dennisvang commented 7 months ago

Bug report

description

Using gzip.compress() with mtime=0 in 3.8<=cpython<=3.10, the OS byte, i.e. the 10th byte in the GZIP header, is set to 255 "unknown" (also see e.g. #83302):

https://github.com/python/cpython/blob/dc0adb44d8d4a33121deaad398f24b5d8ae36d19/Lib/gzip.py#L599

However, in cpython 3.11 and 3.12, the OS byte is suddenly set to a "known" value, e.g. 3 ("Unix") on Ubuntu.

This is not mentioned in the changelog for Python 3.11.

This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

how to reproduce

Here's an example, where byte 10 is \xff in python 3.10 and \x03 in python 3.11:

~ $ python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

~ $ pyenv shell 3.11
~ $ python
Python 3.11.6 (main, Nov 23 2023, 17:30:16) [GCC 11.4.0] on linux
>>> import gzip
>>> gzip.compress(b'', mtime=0)
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

cause

I guess this is caused by python 3.11 delegating the gzip.compress() call to zlib if mtime=0, as mentioned in the docs:

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion. Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

and source:

https://github.com/python/cpython/blob/89ddea4886942b0c27a778a0ad3f0d5ac5f518f0/Lib/gzip.py#L609-L612

Apparently zlib does set the OS byte.

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, macOS, Windows

Linked PRs

gh-114116
gh-120480
gh-120486
gh-120563
gh-120612
gh-120613
gh-120614

dennisvang commented 7 months ago

In itself I think it may be a good thing that the OS byte is properly set.

The problem is just that the change is not explicitly documented, as far as I know.

rhpvorderman commented 5 months ago

Hi @dennisvang . This is my fault, as I delegated gzip.compress(mtime=0) to zlib.compress, incorrectly assuming this was the same. The reason is that zlib.compress is faster. But if it leads to behavioral changes, that is not acceptable.

I believe this can easily be remedied by removing the codepath.

rhpvorderman commented 5 months ago

I have made a PR. Just now and put Bugfix in the name. Now I hope it will get attention.

dennisvang commented 5 months ago

@rhpvorderman Thanks for picking this up.

I wonder, if this is the only side-effect, and if the performance gain from using zlib.compress is worth it, perhaps you could just keep delegating to zlib and change byte 10 back to \xff afterwards?

rhpvorderman commented 5 months ago

Well, as mentioned in the PR, keeping two separate code paths caused issues before. It is best to keep one codepath. There is a mention in the documentation about zlib.compress so users who need the performance can use it themselves.

dennisvang commented 5 months ago

@rhpvorderman You're right, that makes sense.

rhpvorderman commented 2 months ago

ping This bug and fix have been lingering for a while.

serhiy-storchaka commented 3 weeks ago

For reference, this feature was added in bpo-43613 (gh-87779). It included more optimizations, the only issue with delegating the whole compression to zlib, when mtime is 0.

The fix looks correct and it still preserves some speed up. An alternate solution could be to call zlib.compress() (even if mtime is not 0) and then patch the result for mtime and the OS byte, but I do not know how reliable is it and whether method is faster.

gpshead commented 3 weeks ago

Is this really a big deal? We won't be able to backport this to 3.11 as that is in security fix only mode.

In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

zlib cannot be presumed to produce canonical output. There are many different zlib implementations.

Decompressing gzipped data that you did not produce and recompressing it without using identical software is already not guaranteed to produce the same compressed output.

From a reproducible build perspective I suggest always patching irrelevant fields such as gzip header mtime and OS fields to constant values as part of the build.

dennisvang commented 3 weeks ago

Is this really a big deal? ...

@gpshead Probably not for most people.

As commented above, I was only hoping for a small note to be added to the changelog (or documentation).

Just in case someone does rely on the OS byte in the gzip header, in whatever context.

zlib cannot be presumed to produce canonical output. There are many different zlib implementations. ... From a reproducible build perspective I suggest always patching irrelevant fields ...

Very true

I just mentioned the reproducible build example to provide some context, although it was an edge case involving files created on the same machine, with exact same zlib implementation, but a different python version.

serhiy-storchaka commented 3 weeks ago

This may be not such big deal, but it is still a problem. mtime=0 is used to produce more reproducible output, but currently it is less reproducible than for non-zero mtime. We can discuss whether it should be backported to 3.11 or even to 3.12, but this is a bug that should be fixed in new releases.

Since mtime=0 is used to produce more reproducible output, we have the following options:

Always set the OS field to "unknown". This is the behavior before 3.11.
Set the OS field to "unknown" if mtime is zero, and to the OS specific value otherwise.
Always set the OS field to the OS specific value.

The current behavior is not included in reasonable options.

There are also two implementation options:

Use zlib.compress() only for the raw data and generate the header and the trailer in Python.
Use zlib.compress() to produce the full data and patch it (the mtime and the OS fields afterward).

This should be decided based on relative timing of these two methods.

For now, #114116 looks like a simple and safe option, but you are welcome to bikeshedding.

rhpvorderman commented 3 weeks ago

This should be decided based on relative timing of these two methods.

The later is definitely quicker as the crc calculation also happens in one go. Using zlib.compress with wbits 31 and then always patching the header for more consistent results and should be a faster default path.

rhpvorderman commented 3 weeks ago

I did some benchmarking and special-casing mtime=0 does not provide much benefit:

./python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): gzip.compress(data, compresslevel=1, mtime=0)'
50 loops, best of 5: 5.45 msec per loop

/python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): zlib.compress(data, level=1)'
50 loops, best of 5: 5.41 msec per loop

The problem is that I did not separately benchmark this codepath at the time, as it seemed to me that doing everything in C is obviously faster than using struct.pack in combination with building new bytes objects in memory. However, DEFLATE compression is apparently so expensive, even on level 1 that this does not matter.

I also made a new PR. That makes zlib always write the trailer and the header is simply replacing parts of the zlib header. The speed is the same, but it simplifies the code a lot, and always guarantees the OS byte being set to 255. There is no separate codepath for mtime=0. The xfl byte is now set by zlib, as it ideally should be. The resulting code should be easier to maintain going forward.

gpshead commented 3 weeks ago

Always set the OS field to "unknown". This is the behavior before 3.11.

This is what I'd call our ideal behavior.

Where we can't do that, documenting that the field may be set to different values in different situations is worthwhile which my draft docs PR does.

I like the look of your #120486 change. We can back-port that to 3.13 as it is fine to make such a change during the beta period.

python / cpython

gzip.compress(..., mtime=0) in cpython 3.11+ unexpectedly sets OS byte in gzip header #112346

Bug report

description

how to reproduce

cause

CPython versions tested on:

Operating systems tested on:

Linked PRs