mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

Possible optimization? #12

Closed Vadiml1024 closed 1 year ago

Vadiml1024 commented 1 year ago

I've stumbled on following scenario:

I'm mounting a .zip archive with ratarmont, The archive contains 10 .tar files each of the 10G in size; I have a 3rd party antivirus program which scans the mount point and which is EXTREMELY slow (relative to others). So I've analyzed it's behavior with strace. It seems that it tries to determine the file size using the following (or similar) code:

       long pos = ftell(fp);
       long size = fseek(fp, 0, SEEK_END);
       fseek(fp, pos, SEEK_SET);

Of course the first fseek causes ratarmount to fully decompress the member of the .zip file which takes a LOT of time. So I wonder is it possible to make ratarmount to postpone to do the actual seek until the read or write operations? The seek(... SEK_END) call can position virtual offset to the value retrieved form associated struct stat?

mxmlnkn commented 1 year ago

I think this is a duplicate of https://github.com/mxmlnkn/ratarmount/issues/105. Pragzip is not yet used for zip files. Your case should work after it has been integrated.

Vadiml1024 commented 1 year ago

IMHO it is related but not the same. In the .tar.gz case ratarmount has to deflate the whole file to build its index (btw maybe it could be interesting to implement some sort of lazy mode, so not the whole file is inflated upon when mounting so that mount will be fast, and delating is postponed unit actual read/write/readdir).

In case of .zip file there is no need at all to deflate anything upon mount....

mxmlnkn commented 1 year ago

But didn't you say you were mounting a ".zip archive"?

For .tar.gz, I don't see any way around inflating the whole file. That's because the metadata for each file can be anywhere inside the TAR. That's why I have to go over it once to collect all file names. Not inflating the whole file would mean that some file names would be missing in the mount point. I am simply skipping over the file contents during metadata gathering but gzip does not allow to skip data. The gzip decompression also needs an index for that in the first place. I could try to start decoding in the middle of a gzip file but it would never be guaranteed that this would work and I wouldn't even know at which decompressed offset I am currently at. I need to know all data before to determine that.

In case of .zip file there is no need at all to deflate anything upon mount....

That is correct but in your original post you were talking about seeking to the end of zip members ...

mxmlnkn commented 1 year ago

I've stumbled on following scenario:

I'm mounting a .zip archive with ratarmont, The archive contains 10 .tar files each of the 10G in size; I have a 3rd party antivirus program which scans the mount point and which is EXTREMELY slow (relative to others). So I've analyzed it's behavior with strace. It seems that it tries to determine the file size using the following (or similar) code:

       long pos = ftell(fp);
       long size = fseek(fp, 0, SEEK_END);
       fseek(fp, pos, SEEK_SET);

Of course the first fseek causes ratarmount to fully decompress the member of the .zip file which takes a LOT of time. So I wonder is it possible to make ratarmount to postpone to do the actual seek until the read or write operations? The seek(... SEK_END) call can position virtual offset to the value retrieved form associated struct stat?

I was not able to reproduce your observed behavior. I have tried:

base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/10k-1MiB-files.tar", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

Getting the file size like this is completed in 19ms. This indicates that it does not actually decompress the whole member. It simply seeks to the end and returns the size without any decompression.

I'm closing it for now.

Please provide a bash script to reproduce the issue. And it should probably be an issue in the ratarmount repository not in here in the pragzip repository.

Are you by change using ratarmount --lazy --recursive? In that case, it would make some sense that the index is built when accessing the file. To be precise, it should be built the accessing the parent directory.

Vadiml1024 commented 1 year ago
base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/10k-1MiB-files.tar", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

Are you sure this is the script you've used for testing? Because as quoted in cannot work: the file mounted/10k-1MiB-files.tar will not be present in the large.zip

mxmlnkn commented 1 year ago

Yeah sorry, I wanted to make the script more generic and reproducible by changing 10k-1MiB-files.tar to a simple base64 file. The 10k-1MiB-files.tar also contains only base64 data and therefore is similarly compressible and will be added as a compressed member. I checked that with zipinfo. Here is the adjusted script:

base64 /dev/urandom | head -c $(( 8 * 1024 * 1024 * 1024 )) > large
zip large.zip large
ratarmount large.zip mounted
python3 -c 'import io; file=open("mounted/large", "rb"); file.seek(0, io.SEEK_END); print(file.tell())'

The result is the same. It takes ~20ms.