Avoid the initial scan over the entire folder

mstange commented 8 months ago

We were reading information about all data blocks of a folder when creating the FolderReader. Then, once we actually proceeded to do the decompression by reading from the FolderReader, we were reading the same parts of the cab file again.

This commit changes the implementation so that we only read the data blocks once we decompress them, and not upfront. This means that we only do a single pass over the folder data in the common case, and it means we don't need to data at the end of the folder before we can start the decompression work.

The eagerly-read data block information was used for two purposes:

For computing the total uncompressed size.
When seeking to a new spot in the uncompressed data, for mapping the uncompressed offset to the corresponding data block and its start offset in the compressed data.

The total size was used in the Seek implementation throw an error when seeking beyond the file end, and to compute the right offset when seeking relative to the end of the file (SeekFrom::End).

However, FolderReader is not exposed from the public API, it's only used internally for the implementation of the (public) FileReader type. This patch is replacing the general Seek implementation of FolderReader with a seek_to_uncompressed_offset method.

As for "mapping the uncompressed offset to the corresponding data block", we weren't actually taking advantage of the fact that we could know the offset without re-reading the bytes. The seek implementation was calling load_block() for all blocks between the current position and the seeked-to position anyway, decompressing all the bytes on the way. So now we still do that (but now load_block() also reads the data block information whenever needed).

I have a use case where I decompress a file that's streaming in from the network. This change allows me to do the decompression incrementally rather than having to wait for the entire file to be downloaded before I can start decompressing it. For large files this eliminates about 3 seconds of wait time at the end of the download.

mdsteele commented 8 months ago

Thanks! (for both the PR and the detailed writeup)

mstange commented 6 months ago

I would appreciate a release with this fix.

mdsteele commented 6 months ago

Sure, just published as v0.6.0.

mstange commented 6 months ago

Awesome, thank you!

mdsteele / rust-cab

Avoid the initial scan over the entire folder #28