`bulk::decompress_to_buffer` is 40 times slower than `stream::copy_decode`

Background: I use zstd to decompress "block compressed images" (BCn) which have additionally been compressed by zstd before being written to disk, yielding a 33% size reduction. I have 331 images, all of them 2048x2048 pixels in size and exactly 4MiB large when decompressed, which are all compressed individually without a dictionary. Some have high variance and some with very regular patterns. When I run my application, I first load the entire binary blob containing all the images from file into memory, before starting to decompress them from a &[u8] slice of that buffer into another fixed size &mut [u8] slice allocated beforehand, so neither IO nor allocation should not affect the results. For profiling I'm using the profiling crate with the puffin backend, everything in release ofc, and puffin_http to send the profiling results to the external puffin_viewer. Tested on an 6900HS.

Using bulk::decompress_to_buffer on a single thread takes about 52.8s total of which 51.2s are taken up by this method:

#[profiling::function]
fn decode_bcn_zstd_into(&self, src: &[u8], dst: &mut [u8]) -> io::Result<()> {
    let written = zstd::bulk::decompress_to_buffer(src, dst)?;
    assert_eq!(written, dst.len(), "all bytes written");
    Ok(())
}

But if I switch for stream::copy_decode it only takes 2.9s, of which 1.3s are spend on decompression:

#[profiling::function]
fn decode_bcn_zstd_into(&self, src: &[u8], mut dst: &mut [u8]) -> io::Result<()> {
    zstd::stream::copy_decode(src, &mut dst)?;
    assert_eq!(0, dst.len(), "all bytes written");
    Ok(())
}

Just looking at total time spent decompressing, that's a 39.3x speedup! I would honestly have expected the bulk API to be faster in this case, as it's specifically made to deal with slices and having all data being present in memory. Any idea what could cause the speed difference?

gyscos / zstd-rs

`bulk::decompress_to_buffer` is 40 times slower than `stream::copy_decode` #291