Open Firestar99 opened 2 months ago
Hi, and thanks for the report!
This is indeed quite surprising!
Note that the bulk API is intended to re-use a Compressor
(or Decompressor
) between calls - the module-level methods create a (De)compressor every time. Though zstd::stream::copy_decode
also creates a new context on every call, so it shouldn't be that different...
Background: I use zstd to decompress "block compressed images" (BCn) which have additionally been compressed by zstd before being written to disk, yielding a 33% size reduction. I have 331 images, all of them 2048x2048 pixels in size and exactly 4MiB large when decompressed, which are all compressed individually without a dictionary. Some have high variance and some with very regular patterns. When I run my application, I first load the entire binary blob containing all the images from file into memory, before starting to decompress them from a
&[u8]
slice of that buffer into another fixed size&mut [u8]
slice allocated beforehand, so neither IO nor allocation should not affect the results. For profiling I'm using theprofiling
crate with thepuffin
backend, everything in release ofc, andpuffin_http
to send the profiling results to the externalpuffin_viewer
. Tested on an 6900HS.Using
bulk::decompress_to_buffer
on a single thread takes about 52.8s total of which 51.2s are taken up by this method:But if I switch for
stream::copy_decode
it only takes 2.9s, of which 1.3s are spend on decompression:Just looking at total time spent decompressing, that's a 39.3x speedup! I would honestly have expected the bulk API to be faster in this case, as it's specifically made to deal with slices and having all data being present in memory. Any idea what could cause the speed difference?