python / cpython

The Python programming language
https://www.python.org
Other
62.75k stars 30.07k forks source link

Allow setting timestamp in gzip-compressed tarfiles #75707

Open 26146b4a-64b3-41be-a7ce-8c1c38641ca8 opened 7 years ago

26146b4a-64b3-41be-a7ce-8c1c38641ca8 commented 7 years ago
BPO 31526
Nosy @jonashaag, @vadmium, @randombit, @madebr, @FFY00

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-feature', 'library'] title = 'Allow setting timestamp in gzip-compressed tarfiles' updated_at = user = 'https://github.com/randombit' ``` bugs.python.org fields: ```python activity = actor = 'FFY00' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'randombit' dependencies = [] files = [] hgrepos = [] issue_num = 31526 keywords = [] message_count = 5.0 messages = ['302590', '305915', '306065', '375849', '375850'] nosy_count = 5.0 nosy_names = ['jonash', 'martin.panter', 'randombit', 'maarten', 'FFY00'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue31526' versions = ['Python 3.7'] ```

26146b4a-64b3-41be-a7ce-8c1c38641ca8 commented 7 years ago

Context: I have a script which checks out a software release (tagged git revision) and builds an archive to distribute to end users. One goal of this script is that the archive is reproducible, ie if the script is run twice (at different times, on different machines, by different people) it produces bit-for-bit identical output, and thus also has the same SHA-256 hash.

Mostly this works great, using the TarInfo feature of tarfile.py to set the uid/gid/mtime to fixed values. Except I also want to compress the archive, and tarfile calls time.time() to find out the timestamp that will be embedded in the gzip header. This breaks my carefully deterministic output.

I would like it if tarfile accepted an additional keyword that allowed overriding the time value for the gzip header. As it is I just hack around it with

def null_time():
    return 0
time.time = null_time

which does work but is also horrible.

Alternately, tarfile could just always set the timestamp header to 0 and avoid having its output depend on the current clock. I doubt anyone would notice.

The script in question is here https://github.com/randombit/botan/blob/master/src/scripts/dist.py

My script uses Python2 for various reasons, but it seems the same problem affects even the tarfile.py in latest Python3. I would be willing to try writing a patch for this, if anything along these lines might be accepted.

Thanks.

ef46bb13-8e88-4488-a20f-75e542f6f274 commented 6 years ago

This affects me too.

vadmium commented 6 years ago

Perhaps you can compress the tar file using the “gzip.GzipFile” class. It accepts a custom “mtime” parameter (see bpo-4272, added in 2.7 and 3.1+):

>>> gztar = BytesIO()
>>> tar = GzipFile(fileobj=gztar, mode="w", mtime=0)
>>> tarfile.open(fileobj=tar, mode="w|").close()
>>> tar.close()
>>> gztar.getvalue().hex()
'1f8b08000000000002ffedc1010d000000c2a0f74f6d0e37a00000000000000000008037039ade1d2700280000'

However, “tarfile.open” accepts a “compresslevel” argument for two of the compressors, so you could argue it is okay to add another argument to pass to the gzip compressor.

b89a0587-9429-4eb4-bd56-1bb69925b367 commented 4 years ago

I have the same issue. The timestamp is inserted here: https://github.com/python/cpython/blob/802726acf6048338394a6a4750835c2cdd6a947b/Lib/tarfile.py#L419-L420

Because I noticed the timestamp was not included in the timestamp, I could zero it by doing:

with open(gzipped_tarball,"r+b") as f:
    f.seek(4, 0)
    f.write(b"\x00\x00\x00\x00")
b89a0587-9429-4eb4-bd56-1bb69925b367 commented 4 years ago

My previous comment should have contained:

Because I noticed the timestamp was not included in the CRC, ...

ncoghlan commented 1 month ago

Also see #120036

(although the simplest current resolution to this reproducibility issue is to use a wrapper compression format other than gzip that doesn't add an extra timestamp in the compression header, such as xz)