rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
https://rikyoz.github.io/bit7z
Mozilla Public License 2.0
602 stars 110 forks source link

[Possible Bug]: BitArchiveReader::items is slow? #213

Closed kleuter closed 1 month ago

kleuter commented 1 month ago

bit7z version

4.0.x

Compilation options

No response

7-zip version

v23.01

7-zip shared library used

7z.dll / 7z.so

Compilers

MSVC

Compiler versions

No response

Architecture

x86_64

Operating system

Windows

Operating system versions

Windows 10

Bug description

Hi,

Trying to open this ZIP archive (1.38 GB, over 350.000 files).

Calling BitArchiveReader::items takes good 6 seconds (Release) while official 7zFM.exe opens it in no time. Ideas?

PS.

for ( uint32_t i = 0; i < itemsCount(); ++i )

I don't think it's a good idea to call itemsCount over 350.000 times (for this particular archive).

const uint32_t numberOfItems = itemsCount();

would be better, though it doesn't help anyways in the matter.

Steps to reproduce

No response

Expected behavior

No response

Relevant compilation output

No response

Code of Conduct

rikyoz commented 1 month ago

Hi!

Calling BitArchiveReader::items takes good 6 seconds while official 7zFM.exe opens it in no time. Ideas?

Part of the reason is that items() is a vector of whole item objects (a.k.a. BitArchiveItemInfo). This means that each BitArchiveItemInfo object stores all the properties about that item that bit7z could get via 7-Zip; these properties are stored in a map BitProperty (the property) -> BitPropVariant (the value of the property).

Unfortunately, 7-Zip doesn't provide a method to know which item properties are present or not (as far as I know). So what bit7z does is to try to get all of them (~95 iirc); if it fails, the property is not present, otherwise it is stored in the map.

Now, if you do all this for ~350000 items, it can have a significant overhead, obviously. Bit7z does this because it doesn't know how the user wants to use the vector returned by items(). Specifically, the user might want to use the content of the vector after the BitArchiveReader went out of scope, and the archive was closed. So bit7z has to store every information about the items.

However, there's an alternative to items(): you can directly iterate over a BitArchiveReader. This way, you pretty much avoid the overhead of reading and storing all the properties of all the items. BitArchiveReader has the begin()/cbegin()/end()/cend() methods that allow, for example, to iterate over it with a for each loop. An iterator of BitArchiveReader points to a BitArchiveItemOffset object, which is basically just an index of an item within the archive. BitArchiveItemOffset has the same exact interface of BitArchiveItemInfo, but doesn't store any property, it only stores the index of the item. When you ask for something like the path() of the item, bit7z will directly ask 7-Zip for its value.

For example:

BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
const auto items = reader.items();
for ( const auto& item : items ) {
    std::cout << item.index() << ") " << item.path() << '\n';
}

becomes

BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
for ( const auto& item : reader ) {
    std::cout << item.index() << ") " << item.path() << '\n';
}

Sorry for the wall-of-text, I just wanted to give some insight on why items() works like that and might have a significant overhead in some cases like yours.

rikyoz commented 1 month ago

PS.

for ( uint32_t i = 0; i < itemsCount(); ++i )

I don't think it's a good idea to call itemsCount over 350.000 times (for this particular archive).

const uint32_t numberOfItems = itemsCount();

would be better, though it doesn't help anyways in the matter.

Nice catch! I think the compiler might be able to optimise it out of the loop condition, but I'm not sure if it can in this particular case. I'll fix it in the next minor release! Thanks for pointing it out!

rikyoz commented 1 month ago

I've done some quick benchmarks (without any rigorous statistical analysis):

// Optimizing std::cout on MSVC
setvbuf(stdout, nullptr, _IOLBF, 4096);
std::ios::sync_with_stdio(false);

// Printing the content of the archive
BitArchiveReader reader{ lib, "qt-everywhere-src-6.7.0.zip", BitFormat::Zip };
const auto start_items = std::chrono::high_resolution_clock::now();
const auto items = reader.items();
for ( const auto& item : items ) {
    std::cout << item.path() << '\n';
}
const auto end_items = std::chrono::high_resolution_clock::now();

const auto start_it = std::chrono::high_resolution_clock::now();
for ( const auto& item : reader ) {
    std::cout << item.path() << '\n';
}
const auto end_it = std::chrono::high_resolution_clock::now();

std::cout << std::endl;

std::cout << "Time taken\n";
std::cout << "  using items(): " << std::chrono::duration_cast<std::chrono::seconds>(end_items - start_items) << '\n';
std::cout << "  using iterators: " << std::chrono::duration_cast<std::chrono::seconds>(end_it - start_it) << '\n';

Results (release build, MSVC 2022):

Time taken
  using items(): 7s
  using iterators: 3s

Obviously, the more information you print, the slower the loop, in any case. For example, if you want to imitate the l subcommand of 7z.exe

void print_item(const BitArchiveItem& item) {
    const auto date = item.itemProperty(BitProperty::MTime);
    const auto attr = item.isDir() ? "D...." : ".....";
    std::cout << date.getTimePoint() << ' ' <<
                 attr << ' ' <<
                 std::setw(8) << item.size() << ' ' <<
                 std::setw(8) << item.packSize() << ' ' <<
                 item.path() << '\n';
}
// use print_item(item); in the for loops

The results are:

Time taken
  using items(): 11s
  using iterators: 6s

In general, the version using iterators is always faster than the one using the items() function. In my tests, iterators were consistently ~4s faster.

The 7z.exe is significantly slower. However, the comparison is a bit unfair, as it does more things, like parsing the CLI arguments, open the archive (in my program I didn't include this), etc.

>cmake -E time 7z l qt-everywhere-src-6.7.0.zip
<7z output>
Elapsed time (seconds): 67.6411

My guess is that 7zFMT.exe is even faster simply because it doesn't have to immediately display all the paths. For example, if we non-recursively print only the items inside the root folder:

Time taken
  using items(): 4s
  using iterators: 0s
kleuter commented 1 month ago

thanks a lot for such a detailed explanation, it works comparable now!

rikyoz commented 1 month ago

You're welcome!

rikyoz commented 1 month ago

Just for reference, besides calling itemsCount() only once, I found out that the items() method could be further optimized.

The biggest improvement was achieved by calling std::vector::reserve, which is possible since we already know the number of items that the vector will contain.

Another (smaller) improvement was achieved by moving the BitArchiveItemInfo objects that are pushed into the vector instead of copying them.

From my tests, after all the changes, itemsCount() is about 25% faster on average than before. For example, with the same test program I used for this thread to emulate the 7z.exe program, items() is now only ~2s slower than the iterator version:

 Time taken
  using items(): 8s
  using iterators: 6s

The changes are available in the latest v4.0.7.