We were reading information about all data blocks of a folder when
creating the FolderReader. Then, once we actually proceeded to do the
decompression by reading from the FolderReader, we were reading the same
parts of the cab file again.
This commit changes the implementation so that we only read the data
blocks once we decompress them, and not upfront. This means that we only
do a single pass over the folder data in the common case, and it means
we don't need to data at the end of the folder before we can start the
decompression work.
The eagerly-read data block information was used for two purposes:
For computing the total uncompressed size.
When seeking to a new spot in the uncompressed data, for mapping the
uncompressed offset to the corresponding data block and its start offset
in the compressed data.
The total size was used in the Seek implementation throw an error when
seeking beyond the file end, and to compute the right offset when
seeking relative to the end of the file (SeekFrom::End).
However, FolderReader is not exposed from the public API, it's only used
internally for the implementation of the (public) FileReader type.
This patch is replacing the general Seek implementation of
FolderReader with a seek_to_uncompressed_offset method.
As for "mapping the uncompressed offset to the corresponding data
block", we weren't actually taking advantage of the fact that we could
know the offset without re-reading the bytes. The seek implementation
was calling load_block() for all blocks between the current position
and the seeked-to position anyway, decompressing all the bytes on the
way. So now we still do that (but now load_block() also reads the data
block information whenever needed).
I have a use case where I decompress a file that's streaming in from the
network. This change allows me to do the decompression incrementally
rather than having to wait for the entire file to be downloaded before
I can start decompressing it. For large files this eliminates about 3
seconds of wait time at the end of the download.
We were reading information about all data blocks of a folder when creating the FolderReader. Then, once we actually proceeded to do the decompression by reading from the FolderReader, we were reading the same parts of the cab file again.
This commit changes the implementation so that we only read the data blocks once we decompress them, and not upfront. This means that we only do a single pass over the folder data in the common case, and it means we don't need to data at the end of the folder before we can start the decompression work.
The eagerly-read data block information was used for two purposes:
The total size was used in the Seek implementation throw an error when seeking beyond the file end, and to compute the right offset when seeking relative to the end of the file (
SeekFrom::End
).However, FolderReader is not exposed from the public API, it's only used internally for the implementation of the (public)
FileReader
type. This patch is replacing the generalSeek
implementation ofFolderReader
with aseek_to_uncompressed_offset
method.As for "mapping the uncompressed offset to the corresponding data block", we weren't actually taking advantage of the fact that we could know the offset without re-reading the bytes. The seek implementation was calling
load_block()
for all blocks between the current position and the seeked-to position anyway, decompressing all the bytes on the way. So now we still do that (but nowload_block()
also reads the data block information whenever needed).I have a use case where I decompress a file that's streaming in from the network. This change allows me to do the decompression incrementally rather than having to wait for the entire file to be downloaded before I can start decompressing it. For large files this eliminates about 3 seconds of wait time at the end of the download.