oleg-st / ZstdSharp

Port of zstd compression library to c#
MIT License
200 stars 29 forks source link

Inconsistent output with different nbWorkers values #31

Open johnsanc314 opened 5 months ago

johnsanc314 commented 5 months ago

I noticed that different nbWorkers parameter values can result in slightly different output for the compressed data.

I cannot find a pattern to when this happens, but when it does, it appears to be consistent for that number of nbWorkers for a particular input. It looks like an additional 3 bytes of "01 00 00" is added at the very end of the compressed stream. This is always the only difference. In my testing thus far, this extra 3 bytes appears to only occur with higher nbWorker values (22+), but some other higher values also omit these 3 bytes.

Any idea why this might be happening?

Example: 12 vs 32 nbWorkers for compressing an ISO file that is originally 3,469,070,336 bytes, packed into a zip archive: image

johnsanc314 commented 5 months ago

I just found this which is the exact same thing I observed: https://github.com/klauspost/compress/issues/277

This seems like a minor thing, but for my particular use case, consistently reproducible archives are important. This appears to be the only issue with consistency I have found. (The multi-threading works wonderful by the way!)

oleg-st commented 5 months ago

Could you provide code and input samples?

gjefferyes commented 5 months ago

compressTest.txt

Attached is a sample program that demonstrates the original problem as described above. It turn out that if you have multiple works, and then you call Flush(), before you Close() the compression stream, then sometimes these 3 extra bytes are added to the output.

Quick solution was to remove the Flush(), and just keep the Close(). (Flush is not needed anyway, as far as I can tell.)

   using (MemoryStream sOut = new MemoryStream())
   {
       Compressor c = new Compressor(1);
       c.SetParameter(ZSTD_cParameter.ZSTD_c_nbWorkers, numberOfWorkers);
       using (var zOut = new ZstdSharp.CompressionStream(sOut, c))
       {
           zOut.Write(mIn, 0, mIn.Length);

           // Remove this Flush and the problem goes away.
           zOut.Flush();
           zOut.Close();
       }
       byte[] outArray=sOut.ToArray();
}

So may well be a non-issue, just need to know not to call Flush before close.

(I can provide a link to the test file we are using, if you want to look at this any further.)

Thanks

oleg-st commented 5 months ago

01 00 00 is an empty last block, if the compressor has no data at the end it outputs an empty last block, if there is data from which a block can be formed, it becomes the last block and the empty last block becomes unnecessary. Flush forces a block to form, which increases the chance of being left with no data at the end.

Changed Flush so that it outputs all the data.

oleg-st commented 5 months ago

@johnsanc314 @gjefferyes Could you check it with master? (with Flush before Dispose)

gjefferyes commented 4 months ago

So I can confirm that with the latest changes if I call flush() before close() then I always get the 01 00 00 on the end of the stream.

If I call zOut.Flush() followed by zOut.Close() I always now get 01 00 00 If I just call zOut.Close() I get the exact same stream without the 01 00 00 on the end.

I guess my next question would be which is 'correct' should a zstd stream have the 01 00 00 on the end of it, or is this not needed? I know the length of the file I am decompressing, so I don't run into any decompression issues with the 01 00 00 not being there, so I don't feel like it is needed that I actually call Flush at all?

(It would appear that the 7z zstd fork does not have 01 00 00 on the end of its streams)

oleg-st commented 4 months ago

Both situations are correct, if at Close compressor has no data for the last block, then there will be an empty last block 01 00 00 (which says that compression stream is finished). To always have this situation you should make Flush before Close. In .NET Streams, there is no way to specify that a block is last in the Write method.