pauldmccarthy / indexed_gzip

Fast random access of gzip files in Python
Other
99 stars 26 forks source link

Decompression performance #89

Closed mxmlnkn closed 2 years ago

mxmlnkn commented 2 years ago

Hello,

I noticed that indexed_gzip seems to be 10 times slower than archivemount and fuse-archive when decoding large gzipped TARs with ratarmount. I would think that both use zlib, which makes this surprising to me. Are you compiling with -O3? If not, would that help? Or maybe are there other bottlenecks like storing the windows? I just think that this 10x difference would be really helpful to improve upon. I'm pretty sure that it isn't ratarmount who is at fault here but some benchmarks using only indexed_gzip or even the C backend directly still might be helpful for performance "debugging".

archivemount-comparison

mxmlnkn commented 2 years ago

Archivemount does not use zlib directly but libarchive, hence the name. But libarchive seems to use zlib, which makes some of the timing differences a bit weirder.

Here are some more low-level timings for decompressing 397MiB gzipped data to to 512 MiB base64-encoded random data (so basically pure Huffman-encoding with only a few LZ77 backreferences):

base64 /dev/urandom | head -c $(( 512*1024*1024 )) > small
gzip -k small
tar -cf small.tar small
gzip small.tar

Running my self-written benchmark inside the indexed_bzip2 repository:

Decompressing with zlib: min: 2.19763s max: 2.20379 s,  min: 189.079 MB/s max: 189.608 MB/s
Decompressing with libarchive: min: 1.97612s max: 1.98285 s,  min: 210.147 MB/s max: 210.863 MB/s
Decompressing with custom decoder (pragzip): min: 8.78324s max: 9.04052 s,  min: 46.0913 MB/s max: 47.4414 MB/s

Some benchmarks via the command line:

python3 -c 'import indexed_gzip as igz; import time; t0 = time.time(); igz.open("small.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 4.594 s
python3 -c 'import gzip; import time; t0 = time.time(); gzip.open("small.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 3.069 s
time python3 -m pgzip -o - small.gz | wc -c
    real    0m3.289s
time archivemount small.tar.gz mountpoint/
    real    0m1.657s
time ( fuse-archive small.tar.gz mountpoint && stat mountpoint &>/dev/null; )
    real    0m1.697s
time gzip -d -k -c small.gz | wc -c
    real    0m3.510s
time pigz -d -k -c small.gz | wc -c
    real    0m2.225s
time bgzip -d -c small.gz | wc -c
    real    0m2.548s
time igzip -d -c small.gz | wc -c
    real    0m1.523s
time minigzip-zlib-ng -d -k -c small.gz | wc -c
    real    0m1.903s
time gunzip-libdeflate -d -k -c small.gz | wc -c
    real    0m1.841s

time pugz small.gz | wc -c
    real    0m2.372s
time pugz -t 4 small.gz | wc -c
    using 4 threads for decompression (experimental)
    real    0m0.975s
time pugz -t $( nproc ) small.gz | wc -c
    using 24 threads for decompression (experimental)
    real    0m0.985s

Some observations:

Repeat benchmarks with tarred and gzipped Silesia corpus like lzbench uses.

make benchmarkGzip && tests/benchmarkGzip silesia.tar.gz
    Decompressing 68238807 B to 211957760 B with zlib: min: 0.643727s max: 0.652675 s,  min: 104.553 MB/s max: 106.006 MB/s
    Decompressing 68238807 B to 211957760 B with libarchive: min: 0.545091s max: 0.554049 s,  min: 123.164 MB/s max: 125.188 MB/s
    Decompressing 68238807 B to 211957760 B with custom decoder (pragzip): min: 0.127799s max: 0.129439 s,  min: 527.189 MB/s max: 533.956 MB/s

python3 -c 'import indexed_gzip as igz; import time; t0 = time.time(); igz.open("silesia.tar.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 1.249 s
python3 -c 'import gzip; import time; t0 = time.time(); gzip.open("silesia.tar.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 0.908 s
time python3 -m pgzip -o - silesia.tar.gz | wc -c
    real    0m1.837s
time archivemount small.tar.gz mountpoint/
    real    0m1.663s
time ( fuse-archive small.tar.gz mountpoint && stat mountpoint &>/dev/null; )
    real    0m1.629s
time gzip -d -k -c silesia.tar.gz | wc -c
    real    0m1.000s
time pigz -d -k -c silesia.tar.gz | wc -c
    real    0m0.622s
time bgzip -d -c silesia.tar.gz | wc -c
    real    0m0.700s
time igzip -d -c silesia.tar.gz | wc -c
    real    0m0.357s
time pugz silesia.tar.gz | wc -c
    terminate called after throwing an instance of 'gzip_error'
      what():  INVALID_LITERAL
    0

    real    0m0.078s
    user    0m0.005s
    sys 0m0.006s

time minigzip-zlib-ng -d -k -c silesia.tar.gz | wc -c
    real    0m0.443s
time gunzip-libdeflate -d -k -c silesia.tar.gz | wc -c
    real    0m0.403s
time crc32 silesia.tar
    78e42bf0
    real    0m0.236s

Some observations on Silesia decompression:

pauldmccarthy commented 2 years ago

Hi @mxmlnkn, I haven't spent any time optimising the indexed_gzip codebase, so I'm sure performance could be improved. The core of indexed_gzip is essentially an adaptation of the original zran.c example that ships with the zlib source code. It bears little resemblance to the original version, but it does share the same fundamental design for reading/inflation.

The conda-forge packages are built with -O2. I think GHA has deleted the logs for the pypi builds, so I'm not sure about them.

However, using indexed_gzip purely for a one-off decompress isn't intended usage, so I'm not sure if these comparisons makes a lot of sense. The point of indexed_gzip is for improved random seek time, after an index has been built.

(edited - added example code)

If you are interested in "raw" performance, you could adjust your benchmark code to disable the CRC32 check, increased read buffer sizes, and with a seek point spacing larger than the size of the uncompressed data, which would effectively disable creation of the index. e.g.:

import indexed_gzip as igzip
IndexedGzipFile.open('file.gz', spacing=2**30, readbuf_size=2**30, buffer_size=2**30).read()

This should cause the read to be performed via a single call to the zlib inflate function.

mxmlnkn commented 2 years ago

I get some very weird results when testing out these parameters. When I increase the buffers or even just the spacing, then it takes a lot more time. This might explain why I observed the 10x slower speed with ratarmount! I think, I changed the default spacing to 16 MiB.

python3 -m pip install --user pgzip indexed_gzip
python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz").read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 4.666 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.IndexedGzipFile("small.gz", spacing=2**30, readbuf_size=2**30, buffer_size=2**30).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Floating point exception

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(1*1024**2), readbuf_size=int(1*1024**2), buffer_size=int(1*1024**2)).read(); 
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 4.286 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(32*1024**2), readbuf_size=int(1*1024**2), buffer_size=int(1*1024**2)).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 7.407 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(128*1024**2), readbuf_size=int(1024**2), buffer_size=int(128*1024**2)).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 13.290 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.IndexedGzipFile("small.gz", spacing=16*1024**2).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 6.403 s

Note also the weird floating point exception when I try the large buffer / spacing sizes you recommended.

Also, I think there might be a problem with the buffers because even 512 MiB buffers should not slow down a program that much. Except maybe if you are clearing them to zeros without using memset or something like that..

pauldmccarthy commented 2 years ago

That does sound strange - I'll look into it when I get a minute...

mxmlnkn commented 2 years ago

So I did try to look at the code and did some benchmarks and I arrived at the conclusion that the underlying problem is that zran_read always tries to seek from an index point even if no seek was actually applied since the last call to zran_read. But I don't feel up to the task of fixing it because that would mean checking when seeking is actually necessary and also storing the zlib state of the last call. I'm not sure whether this information is already available or needs to be added somewhere.

As a quick workaround, zran_read should only be called to decode buffers much longer than the index spacing, which might be easily doable in Python by wrapping IndexedGzipFile in an io.BufferedReader with buffer size ~ 2-10x index spacing.

The above explanation does still not explain why runtimes get worse for larger readbuf_size. But, I saw a lot of heuristics like data_size = 4 * spacing in zran_expand_index, which shouldn't even be necessary, a hardcoded data_size = 128 * 1024 should also work and be faster and more memory-efficient. Similarly, there were a lot of usages of calloc, where I think malloc should have sufficed. I'm not sure what the performance difference between these two are, but calloc sounds slower.

Here are some benchmarks I did for varying configurations. It can be seen that calls with 4 KiB even slows down Python's GzipFile, but everything above 32 KiB should be sufficiently large reads for optimal speed. This is a stark difference to IndexedGzipFile which might lead to performance bugs like I observed with ratarmount. With large enough reads (in the limit, reading the whole sample file at once), the speeds are good though.

1.086  s Read 4      KiB chunks from GzipFile                                                                            
0.817  s Read 32     KiB chunks from GzipFile                                                                            
0.774  s Read 128    KiB chunks from GzipFile                                                                            
0.760  s Read 1024   KiB chunks from GzipFile                                                                            
0.811  s Read 32768  KiB chunks from GzipFile                                                                            
0.811  s Read 131072 KiB chunks from GzipFile                                                                            

1.722  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.487  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.497  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.470  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.032  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
0.739  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       

1.788  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.467  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.454  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.450  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.036  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
0.739  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      

9.044  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.957  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.902  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.854  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
1.413  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
0.733  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  

9.056  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.824  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.834  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.886  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
1.415  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
0.736  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 

13.596 s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.536 s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.491 s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.500 s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
1.697  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
0.787  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  

13.745 s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.518 s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.451 s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.468 s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
1.669  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
0.770  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
mxmlnkn commented 2 years ago

I saw that you are already using a BufferedReader and that there even is a buffer_size argument for IndexedGzipFile. I added a better default in #90 if it has not been specified, which increases performance by ~10x. Here are the results for buffer-size = 4 * spacing.


1.157 s Read      4 KiB chunks from GzipFile
0.816 s Read     32 KiB chunks from GzipFile
0.781 s Read    128 KiB chunks from GzipFile
0.773 s Read   1024 KiB chunks from GzipFile
0.811 s Read  32768 KiB chunks from GzipFile
0.812 s Read 131072 KiB chunks from GzipFile

1.369 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.181 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.162 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.167 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.026 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
0.728 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)

1.321 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.169 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.150 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.157 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.040 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
0.723 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)

1.028 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.761 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.741 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.736 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.783 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.734 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)

1.044 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.766 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.743 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.735 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.780 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.732 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)

1.090 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.807 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.783 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.781 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.814 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.758 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)

1.056 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.803 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.775 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.773 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.825 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.775 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)