rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
https://rikyoz.github.io/bit7z
Mozilla Public License 2.0
650 stars 117 forks source link

[Question] Extract files/folders matching complex pattern? #24

Closed levicki closed 5 years ago

levicki commented 5 years ago

Before describing my issue please allow me to thank you for this great, well written, and simple to use library which I just discovered, and which you are so generously providing free of charge -- I really appreciate it.

I would like to extract files from an archive using a complex pattern. From the documentation it seems that BitExtractor::extractMatching() method should be used, but its pattern support does not seem to be documented.

I checked the code in BitExtractor.cpp and to me it seems that it supports filesystem wildcards such as ? and *. Sadly, that is not adequate for my use case (I am unpacking NVIDIA drivers), because I want to use regex to match multiple variants of paths inside an archive.

I see that there is also a callback support (FileCallback) but to me it seems that it doesn't allow to cancel either the current file or the whole operation so that appears to be a dead-end too as far as I am concerned.

There are at least two approaches to solving this issue:

  1. Provide a BitExtractor::extractMatchingRegex() method which accepts regex string instead of filesystem wildcard pattern.
  2. Provide a callback which is receiving current item path (not just name), and which allows skipping an item and/or ending the archive operation based on its return value.

Personally, I would prefer the option 1 and I believe it would be simple and fast to implement.

An example of my regex filter for extraction of NVIDIA drivers:

std::wstring DriverExtractRegex = LR"(^(Display\.(Driver|Optimus)|HDAudio|MSVCRT|NGXCore|NVI2|PhysX|PPC).*$|^GFExperience\\(PrivacyPolicy\\PrivacyPolicy_[a-z]{2}\-[A-Z]{2,3}\.htm|EULA.html|FunctionalConsent_[a-z]{2}\-[A-Z]{2,3}\.txt)$|^(ListDevices|license|EULA)\.txt$|^setup\.(cfg|exe)$)";

An example of archive for extracting (542.8 MB download): http://us.download.nvidia.com/Windows/431.60/431.60-desktop-win10-64bit-international-whql.exe

I am looking forward to your thoughts on this issue.

levicki commented 5 years ago

Here is my current workaround:

#include <vector>
#include <string>
#include <regex>

#include "bit7z.hpp"

using namespace std;
using namespace bit7z;

int main()
{
    static const wstring LibraryPath = LR"(7za.dll)";
    static const wstring ArchivePath = LR"(431.70-desktop-win10-64bit-international-nsd-whql.exe)";
    static const wstring OutputPath = LR"(C:\NVIDIA)";
    static const wstring DriverExtractRegexPattern = LR"(^(Display\.(Driver|Optimus)|HDAudio|MSVCRT|NGXCore|NVI2|PhysX|PPC).*$|^GFExperience\\(PrivacyPolicy\\PrivacyPolicy_[a-z]{2}\-[A-Z]{2,3}\.htm|EULA.html|FunctionalConsent_[a-z]{2}\-[A-Z]{2,3}\.txt)$|^(ListDevices|license|EULA)\.txt$|^setup\.(cfg|exe)$)";
    static const std::wregex DriverExtractRegex(DriverExtractRegexPattern, regex::ECMAScript | regex::optimize);

    Bit7zLibrary Library(LibraryPath);
    BitArchiveInfo Info(Library, ArchivePath, BitFormat::SevenZip);
    BitExtractor Extractor(Library, BitFormat::SevenZip);

    auto Items = Info.items();

    vector<unsigned int> ItemIndices;

    for (auto& Item : Items) {
        if (regex_match(Item.path(), DriverExtractRegex)) {
            ItemIndices.push_back(Item.index());
        }
    }

    if (!ItemIndices.empty()) {
        Extractor.extractItems(ArchivePath, ItemIndices, OutputPath);
    }
}

I am not sure if built-in implementation would be any faster and/or more efficient, but I would still prefer having a single method to call.

rikyoz commented 5 years ago

Hi! First of all, thank you for using bit7z and for having submitted your issues!

Before describing my issue please allow me to thank you for this great, well written, and simple to use library which I just discovered, and which you are so generously providing free of charge -- I really appreciate it.

You are welcome and thank you, I'm glad my work is appreciated, it means a lot to me!

I would like to extract files from an archive using a complex pattern. From the documentation it seems that BitExtractor::extractMatching() method should be used, but its pattern support does not seem to be documented.

Yeah, I'm sorry, I must definitely document the pattern system supported by BitExtractor::extractMatching(). I've planned to fix this in the next release!

I checked the code in BitExtractor.cpp and to me it seems that it supports filesystem wildcards such as ? and *. Sadly, that is not adequate for my use case (I am unpacking NVIDIA drivers), because I want to use regex to match multiple variants of paths inside an archive.

Yes, unfortunately it supports only wildcards matching. Actually, when I was working for the matching extraction feature (to close issue #12), my first idea was indeed to use regexes! However, at that time I was struggling to maintain the support to MSVC 2010, which had some problem with regexes and they were still in the tr1. Hence I decided to stick to a simpler wildcards matching algorithm. Now, after some time I decided to drop the support to MSVC 2010, hence there should be no problem in implementing matching using regexes!

I see that there is also a callback support (FileCallback) but to me it seems that it doesn't allow to cancel either the current file or the whole operation so that appears to be a dead-end too as far as I am concerned.

Yeh, you're right, it is an output only callback, you cannot use it to cancel an operation (even though it is a possibility that sooner or later I would like to provide to users, e.g. to handle overwriting of files and to allow the user to make a choice about the action to take).

There are at least two approaches to solving this issue:

  1. Provide a BitExtractor::extractMatchingRegex() method which accepts regex string instead of filesystem wildcard pattern.
  2. Provide a callback which is receiving current item path (not just name), and which allows skipping an item and/or ending the archive operation based on its return value.

Personally, I would prefer the option 1 and I believe it would be simple and fast to implement.

I agree with you, I think that the first option is the right choice, no doubt! At the moment I'm working on implementing the support to standard streams as input/output for compression and extraction. Once completed, and after some code cleaning, I will implement this and it should be available in the next minor release (v3.1). Unfortunately, I can't give you precise completion dates since I'm also busy with other projects, however I hope to release it asap.

An example of my regex filter for extraction of NVIDIA drivers:

std::wstring DriverExtractRegex = LR"(^(Display\.(Driver|Optimus)|HDAudio|MSVCRT|NGXCore|NVI2|PhysX|PPC).*$|^GFExperience\\(PrivacyPolicy\\PrivacyPolicy_[a-z]{2}\-[A-Z]{2,3}\.htm|EULA.html|FunctionalConsent_[a-z]{2}\-[A-Z]{2,3}\.txt)$|^(ListDevices|license|EULA)\.txt$|^setup\.(cfg|exe)$)";

An example of archive for extracting (542.8 MB download): http://us.download.nvidia.com/Windows/431.60/431.60-desktop-win10-64bit-international-whql.exe

Perfect, I will use these for testing! Thank you!

rikyoz commented 5 years ago

Here is my current workaround:

#include <vector>
#include <string>
#include <regex>

#include "bit7z.hpp"

using namespace std;
using namespace bit7z;

int main()
{
  static const wstring LibraryPath = LR"(7za.dll)";
  static const wstring ArchivePath = LR"(431.70-desktop-win10-64bit-international-nsd-whql.exe)";
  static const wstring OutputPath = LR"(C:\NVIDIA)";
  static const wstring DriverExtractRegexPattern = LR"(^(Display\.(Driver|Optimus)|HDAudio|MSVCRT|NGXCore|NVI2|PhysX|PPC).*$|^GFExperience\\(PrivacyPolicy\\PrivacyPolicy_[a-z]{2}\-[A-Z]{2,3}\.htm|EULA.html|FunctionalConsent_[a-z]{2}\-[A-Z]{2,3}\.txt)$|^(ListDevices|license|EULA)\.txt$|^setup\.(cfg|exe)$)";
  static const std::wregex DriverExtractRegex(DriverExtractRegexPattern, regex::ECMAScript | regex::optimize);

  Bit7zLibrary Library(LibraryPath);
  BitArchiveInfo Info(Library, ArchivePath, BitFormat::SevenZip);
  BitExtractor Extractor(Library, BitFormat::SevenZip);

  auto Items = Info.items();

  vector<unsigned int> ItemIndices;

  for (auto& Item : Items) {
      if (regex_match(Item.path(), DriverExtractRegex)) {
          ItemIndices.push_back(Item.index());
      }
  }

  if (!ItemIndices.empty()) {
      Extractor.extractItems(ArchivePath, ItemIndices, OutputPath);
  }
}

I am not sure if built-in implementation would be any faster and/or more efficient, but I would still prefer having a single method to call.

The algorithm can be improved even without a built-in implementation. The main problem lies in the BitArchiveInfo::items() method: in fact, this reads all the properties of all the items in the archive (its purpose is to allow reading the items' properties even after the BitArchiveInfo object is destroyed and hence the archive closed). However, the only item property needed by the algorithm is the path and reading all the properties is not useful and a waste! Probably a for cycle using BitArchiveInfo::itemsCount() and BitArchiveInfo::getItemProperty(...) would be more efficient, since you would only request the BitProperty::Path. A built-in implementation, in this latter case, would not be any faster but, as you, I prefer to provide a single method in order to make the user code more clean!

Thank you again for your feedback!

levicki commented 5 years ago

Hi, you are welcome. Thanks for considering suggestions for improvements.

Also, thank you for letting me know about getItemProperty, I will definitely try that now. My initial goal was just to make it work :-)

levicki commented 5 years ago

This works too:

for (uint32_t i = 0; i < Info.itemsCount(); i++) {
    BitPropVariant Prop = Info.getItemProperty(i, BitProperty::Path);
    if (!Prop.isEmpty() && Prop.isString()) {
        if (regex_match(Prop.getString(), ExtractorRegex)) {
            ItemIndices.push_back(i);
        }
    }
}

Thanks for suggestion.

rikyoz commented 5 years ago

This works too:

for (uint32_t i = 0; i < Info.itemsCount(); i++) {
  BitPropVariant Prop = Info.getItemProperty(i, BitProperty::Path);
  if (!Prop.isEmpty() && Prop.isString()) {
      if (regex_match(Prop.getString(), ExtractorRegex)) {
          ItemIndices.push_back(i);
      }
  }
}

Thanks for suggestion.

Yeah, this is the algorithm I was thinking, you are welcome! As a side note: you only need to check for isString(), since a string variant cannot be an empty variant (but it can be an empty string).

levicki commented 5 years ago

The documentation for getItemProperty() says "Returns the current value of the item property or an empty BitPropVariant if no value is specified." To me it is not quite clear what "no value is specified" means. If it means that the archive item for some reason can have no Path value (e.g. corruption of some sort) then !IsEmpty() shoud guard against that. On the other hand, if it means that it returns empty variant if I specify NoProperty then it is safe to remove the check.

Perhaps the wording could be improved?

rikyoz commented 5 years ago

The documentation for getItemProperty() says "Returns the current value of the item property or an empty BitPropVariant if no value is specified." To me it is not quite clear what "no value is specified" means. If it means that the archive item for some reason can have no Path value (e.g. corruption of some sort) then !IsEmpty() shoud guard against that. On the other hand, if it means that it returns empty variant if I specify NoProperty then it is safe to remove the check.

Perhaps the wording could be improved?

Yeah, I definitely need to improve the wording. It should be "or an empty BitPropVariant if the item has no value for the property", i.e. the first case you suggested. Anyway, isEmpty() is still not needed in this case: isString() checks for the internal type of the variant and if the variant is empty it returns false (an empty variant is not a string variant). Hence if for some reason there is no path value, the method will return an empty variant and isString() will return false.

levicki commented 5 years ago

Thanks for clarifying.