Zsdt.decompress is extremly slow with too big originalSize parameter

Okapist commented 1 year ago

Example:

Get some random data

  var data = new byte[100_000];
  Random r = new Random();
  r.nextBytes(data);
  compressedData = Zstd.compress(data, 6);

And run benchmarks:

    @Benchmark
    public void testOne(Blackhole blackhole) {
        blackhole.consume(Zstd.decompress(compressedData, 1_000_000));
    }

    @Benchmark
    public void testTen(Blackhole blackhole) {
        blackhole.consume(Zstd.decompress(compressedData, 10_000_000));
    }

    @Benchmark
    public void testNo(Blackhole blackhole) {
        blackhole.consume(Zstd.decompress(compressedData, (int)Zstd.decompressedSize(compressedData)));
    }

Result:

Benchmark                               Mode  Cnt        Score   Error  Units
BenchmarkZstd.testNo                    avgt         12918,663          ns/op
BenchmarkZstd.testOne                   avgt         76446,863          ns/op
BenchmarkZstd.testTen                   avgt        562683,985          ns/op

May be fix it with something like this?

    public static byte[] decompress(byte[] src, int originalSize) {
        ZstdDecompressCtx ctx = new ZstdDecompressCtx();
        try {
            int realSize = (int)Zstd.decompressedSize(src);
            if(originalSize > 1000 && realSize < originalSize>>1) {
                byte[] smallResult = ctx.decompress(src, realSize);
                return Arrays.copyOfRange(smallResult, 0, originalSize);
            } else {
                return ctx.decompress(src, originalSize);
            }
        } finally {
            ctx.close();
        }
    }

luben commented 1 year ago

I guess the overhead comes from allocating big array and later copying just the decompressed part if the originalSize turns out to be wrong. I don't think this is a fail on the Zstd itself, and most on supplying wrong originalSize.

Using Zstd.decompressedSize is interesting, but may be also limited, as not all zstd compressed payloads have their original size embedded in the header, e.g. if it was compressed with the ZstdOutputStream.

Okapist commented 1 year ago

What about best practice for my case? I think it's typical usecase.

I have some compressed data. Typical size is about 100kb, but sometimes there some megabytes. I want to decompress all this data. And I want some zipbomb protection. If there more than 20mb, I don't need this data. Exception or null is ok.

I write something like this:

    public byte[] decompress(byte[] data) {
        long size = -1;
        try {
            size = Zstd.decompressedSize(data); //possible exception I guess. on data without size in header
        } catch (Exception ex) {
            return Zstd.decompress(data, 20_000_000);
        }
        if (size < 1 || size >= 20_000_000)
            return Zstd.decompress(data, 20_000_000);
        else
            return Zstd.decompress(data, (int) size);
    }

May be add this code to library? For peoples who don't know size and simply want to decompress with zipbomb protection.

mortengrouleff commented 1 year ago

I think best practice is to include exact size of uncompressed payload along with the compressed data, and then apply that when decompressing. You probably also want some checksum on both the compressed payload as well as inside, along with the uncompressed data, to avoid issues with broken payloads… I think you want to do this “outside” of the zstd library, in your own application.

From: Okapist @.> Reply to: luben/zstd-jni @.> Date: Thursday, 10 November 2022 at 15.33 To: luben/zstd-jni @.> Cc: Subscribed @.> Subject: [External] Re: [luben/zstd-jni] Zsdt.decompress is extremly slow with too big originalSize parameter (Issue #239)

What about best practice for my case? I think it's typical usecase.

I have some compressed data. Typical size is about 100kb, but sometimes there some megabytes. I want to decompress all this data. And I want some zipbomb protection. If there more than 20mb, I don't need this data. Exception or null is ok.

I write something like this: public byte[] decompress(byte[] data) { long size = -1; try { size = Zstd.decompressedSize(data); //possible exception I guess. on data without size in header } catch (Exception ex) { return Zstd.decompress(data, 20_000_000); } if (size < 1 || size >= 20_000_000) return Zstd.decompress(data, 20_000_000); else return Zstd.decompress(data, (int) size); } May be add this code to library? For peoples who don't know size and simply want to decompress with zipbomb protection.

— Reply to this email directly, view it on GitHub [github.com], or unsubscribe [github.com]. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Okapist commented 1 year ago

Thanks. I'll change my data store scheme.

May be add some comments at library code? That increase originalSize much more than real uncompressed data size will dramatically drops performance. It's unintuitive behavior.

luben commented 1 year ago

Yes, I think warning in the API docs should be helpful.

luben / zstd-jni

Zsdt.decompress is extremly slow with too big originalSize parameter #239