tukaani-project / xz

XZ Utils
https://tukaani.org/xz/
Other
623 stars 115 forks source link

[Bug?]: Default option. Non determinism in multithreads? #85

Closed milkylainen closed 9 months ago

milkylainen commented 9 months ago

Well. Not really a bug, but perhaps more a question of default and helpful information. Please correct me if I'm wrong here. I've always lived with the assumption that using xz with more than one thread isn't deterministic. Ie, the compression result will vary with threads? By changing the default, various usages will start to perhaps experience different results when building on different machines. Like packagers, embedded build envs etc.

Has anything changed in the determinism department or can users just expect that default is to create variable results with the defaults?

Version

5.6

Operating System

Linux

Relevant log output

No response

thesamesam commented 9 months ago

Please see https://github.com/tukaani-project/xz/commit/6daa4d0ea46a8441f21f609149f3633158bf4704:

  • Output from single-threaded and multi-threaded compressors differ but such changes could happen for other reasons too (they just haven't happened since 5.0.0).

I believe (although see if a maintainer confirms) that the threaded compressor is deterministic - it doesn't depend on the thread count and so on, so even with 1 thread, the threaded compressor output is the same. I believe the only difference is that it's chunked / includes sizes so it can be decompressed in parallel.

It's just that it's different compared to the non-threaded compressor.

But the non-threaded compressor could've changed output at some point anyway, it just didn't.

xz(1) also says:

No size information is stored in block headers, thus files created in single-threaded mode won’t be identical to files created in multi-threaded mode. In multi-threaded mode the sizes of the blocks are stored in the block headers. This isn’t done in single-threaded mode, so the encoded output won’t be identical to that of the multi-threaded mode. The single-threaded and multi-threaded compressors produce different output. Single-threaded compressor will give the smallest file size but only the output from the multi-threaded compressor can be decompressed using multiple threads. Setting threads to 1 will use the single-threaded mode. Setting threads to any other value, including 0, will use the multi-threaded compressor even if the system supports only one hardware thread. (xz 5.2.x used single-threaded mode in this situation.)

milkylainen commented 9 months ago

Oh. Looks like I've been mistaken and using -T0 should always result in deterministic results. Hopefully a maintainer can confirm. @thesamesam, appreciate the hint!

JiaT75 commented 9 months ago

Hello!

I have seen this misconception before and I can understand where it is coming from. The short answer is that multi threaded compression mode is in fact deterministic.

It does not matter how many threads are used. @thesamesam is correct, multi threaded encoding mode with 1 thread will produce the same as output as 10 threads. The output is different from single threaded mode, which is where the confusion happens. Single threaded mode does not put the block sizes in the headers and will put all of the data in a single block by default.

Setting -T0 will always use multi threaded mode. In the past, if only one thread was used in -T0 mode then it would operate in single threaded mode and thus produce single threaded output. So I believe that is where the non-determinstic belief originated from, since sometimes the output would be different when using -T0. This is no longer the case.

I hope this helps!

milkylainen commented 9 months ago

Thanks for clearing things up!