Open dennisvang opened 7 months ago
In itself I think it may be a good thing that the OS
byte is properly set.
The problem is just that the change is not explicitly documented, as far as I know.
Hi @dennisvang . This is my fault, as I delegated gzip.compress(mtime=0) to zlib.compress, incorrectly assuming this was the same. The reason is that zlib.compress is faster. But if it leads to behavioral changes, that is not acceptable.
I believe this can easily be remedied by removing the codepath.
I have made a PR. Just now and put Bugfix in the name. Now I hope it will get attention.
@rhpvorderman Thanks for picking this up.
I wonder, if this is the only side-effect, and if the performance gain from using zlib.compress
is worth it, perhaps you could just keep delegating to zlib
and change byte 10 back to \xff
afterwards?
Well, as mentioned in the PR, keeping two separate code paths caused issues before. It is best to keep one codepath. There is a mention in the documentation about zlib.compress so users who need the performance can use it themselves.
@rhpvorderman You're right, that makes sense.
ping This bug and fix have been lingering for a while.
For reference, this feature was added in bpo-43613 (gh-87779). It included more optimizations, the only issue with delegating the whole compression to zlib, when mtime is 0.
The fix looks correct and it still preserves some speed up. An alternate solution could be to call zlib.compress()
(even if mtime is not 0) and then patch the result for mtime and the OS byte, but I do not know how reliable is it and whether method is faster.
Is this really a big deal? We won't be able to backport this to 3.11 as that is in security fix only mode.
In our case, hash checking fails after decompressing and re-compressing a gzipped archive.
zlib cannot be presumed to produce canonical output. There are many different zlib implementations.
Decompressing gzipped data that you did not produce and recompressing it without using identical software is already not guaranteed to produce the same compressed output.
From a reproducible build perspective I suggest always patching irrelevant fields such as gzip header mtime and OS fields to constant values as part of the build.
Is this really a big deal? ...
@gpshead Probably not for most people.
As commented above, I was only hoping for a small note to be added to the changelog (or documentation).
Just in case someone does rely on the OS byte in the gzip header, in whatever context.
zlib cannot be presumed to produce canonical output. There are many different zlib implementations. ... From a reproducible build perspective I suggest always patching irrelevant fields ...
I just mentioned the reproducible build example to provide some context, although it was an edge case involving files created on the same machine, with exact same zlib implementation, but a different python version.
This may be not such big deal, but it is still a problem. mtime=0
is used to produce more reproducible output, but currently it is less reproducible than for non-zero mtime
. We can discuss whether it should be backported to 3.11 or even to 3.12, but this is a bug that should be fixed in new releases.
Since mtime=0
is used to produce more reproducible output, we have the following options:
mtime
is zero, and to the OS specific value otherwise.The current behavior is not included in reasonable options.
There are also two implementation options:
zlib.compress()
only for the raw data and generate the header and the trailer in Python.zlib.compress()
to produce the full data and patch it (the mtime and the OS fields afterward).This should be decided based on relative timing of these two methods.
For now, #114116 looks like a simple and safe option, but you are welcome to bikeshedding.
This should be decided based on relative timing of these two methods.
The later is definitely quicker as the crc calculation also happens in one go. Using zlib.compress with wbits 31 and then always patching the header for more consistent results and should be a faster default path.
I did some benchmarking and special-casing mtime=0 does not provide much benefit:
./python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): gzip.compress(data, compresslevel=1, mtime=0)'
50 loops, best of 5: 5.45 msec per loop
/python -m timeit -s 'import gzip; import zlib; data=b"Some arbitrary small data of reasonable size to be in-memory compressed. JSON API responses, tweets, stuff like that. Gewoon wat uit de nek kletsen tot er een redelijke hoeveelheid data is. Dat gaat makkelijker in mijn moedertaal uiteraard."' 'for i in range(100): zlib.compress(data, level=1)'
50 loops, best of 5: 5.41 msec per loop
The problem is that I did not separately benchmark this codepath at the time, as it seemed to me that doing everything in C is obviously faster than using struct.pack in combination with building new bytes objects in memory. However, DEFLATE compression is apparently so expensive, even on level 1 that this does not matter.
I also made a new PR. That makes zlib always write the trailer and the header is simply replacing parts of the zlib header. The speed is the same, but it simplifies the code a lot, and always guarantees the OS byte being set to 255. There is no separate codepath for mtime=0. The xfl byte is now set by zlib, as it ideally should be. The resulting code should be easier to maintain going forward.
Always set the OS field to "unknown". This is the behavior before 3.11.
This is what I'd call our ideal behavior.
Where we can't do that, documenting that the field may be set to different values in different situations is worthwhile which my draft docs PR does.
I like the look of your #120486 change. We can back-port that to 3.13 as it is fine to make such a change during the beta period.
Bug report
description
Using
gzip.compress()
withmtime=0
in 3.8<=cpython<=3.10, theOS
byte, i.e. the 10th byte in the GZIP header, is set to255
"unknown" (also see e.g. #83302):https://github.com/python/cpython/blob/dc0adb44d8d4a33121deaad398f24b5d8ae36d19/Lib/gzip.py#L599
However, in cpython 3.11 and 3.12, the
OS
byte is suddenly set to a "known" value, e.g.3
("Unix") on Ubuntu.This is not mentioned in the changelog for Python 3.11.
This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.
how to reproduce
Here's an example, where byte 10 is
\xff
in python 3.10 and\x03
in python 3.11:cause
I guess this is caused by python 3.11 delegating the
gzip.compress()
call tozlib
ifmtime=0
, as mentioned in the docs:and source:
https://github.com/python/cpython/blob/89ddea4886942b0c27a778a0ad3f0d5ac5f518f0/Lib/gzip.py#L609-L612
Apparently
zlib
does set theOS
byte.CPython versions tested on:
3.8, 3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, macOS, Windows
Linked PRs