ostreedev / ostree

Operating system and container binary deployment and upgrades
https://ostreedev.github.io/ostree/
Other
1.24k stars 291 forks source link

ostree archive's mtime modification results in slower python execution #1469

Open euank opened 6 years ago

euank commented 6 years ago

System information

$ cat /etc/os-release 
NAME=Fedora
VERSION="27 (Atomic Host)"
ID=fedora
VERSION_ID=27
PRETTY_NAME="Fedora 27 (Atomic Host)"
VARIANT="Atomic Host"
VARIANT_ID=atomic.host

$ curl 169.254.169.254/latest/meta-data/instance-type; echo; curl 169.254.169.254/latest/meta-data/ami-id; echo
t2.micro
ami-4ef37336

High level problem

At a high level, the problem is that atomic and other python programs are slow by default.

$ time atomic --help >/dev/null

real    0m1.268s
user    0m1.184s
sys 0m0.059s

$ sudo ostree admin unlock

$ time sudo atomic --help > /dev/null

real    0m1.402s
user    0m1.202s
sys 0m0.136s
$ time sudo atomic --help > /dev/null

real    0m0.453s
user    0m0.375s
sys 0m0.055s

As we can clearly see, the atomic cli has a performance gain of 3x to 4x simply by letting it rewrite the pyc files. This performance issue will likely be shared by most other python software.

Why does this happen?

ostree sets a file's timestamps to 0:

https://github.com/ostreedev/ostree/blob/6bf4b3e1d8ae0173f3f97a26990a6a6341c11049/src/libostree/ostree-repo-libarchive.c#L986-L989

Unfortunately, this results in some files being considered incorrect. For example, a .pyc file, per this blog, includes information about the 'mtime' of the .py file it was sourced from.

Since ostree modifies that timestamp, the .pyc file created during the rpmbuild becomes invalid and is ignored.

We can verify this is the problem by using perf to notice where time is spent, or using strace to notice which files are changed that result in the performance speedup, e.g.:


$ sudo strace -e trace=open,openat atomic --help 2>&1 | cat | grep -Eo '/usr/lib[^"]+' | grep -E "\.py$" | wc -l
455

$ # ostree unlock and sudo atomic, as above

$ sudo strace -e trace=open,openat atomic --help 2>&1 | cat | grep -Eo '/usr/lib[^"]+' | grep -E "\.py$" | wc -l
0

$ find /var/tmp/ostree-unlock-ovl.XCJGFZ/upper/ -name "*.pyc" | wc -l
455

Additional considerations

Note that python is not unique in using mtime to store information about whether a given cache file is up to date. I don't know off-hand of other software that's definitely impacted by this choice. ~, but I wouldn't be surprised if the go toolchain had a similar problem~ (edit, nope, go's pkg directory on usr seems to be handled fine)

cgwalters commented 6 years ago

See https://github.com/ostreedev/ostree/commit/fe5ed36461a52b38cb366162bfd222ad3b326bf2 and https://mail.gnome.org/archives/ostree-list/2015-February/msg00008.html and https://github.com/flatpak/flatpak-builder/blob/d329cef84920f9ae9839bdc257710ea0e2104108/src/builder-post-process.c#L38

cgwalters commented 6 years ago

I think the ideal fix here is to change RPM to canonicalize those timestamps to 0 as well, basically bringing the libostree model there.

Alternatively, cPython could learn to stop doing timestamp checks on read-only filesystems in general, or perhaps just /usr if it's read-only.

cgwalters commented 6 years ago

I suspect in practice a bigger win though would be teaching cPython to generate a larger cache file that could be easily mmap()ed like fontconfig and ldconfig etc. do.

nanonyme commented 5 years ago

Seems like related to https://gitlab.com/freedesktop-sdk/freedesktop-sdk/issues/554 (although not identical root cause) The solution is to set SOURCE_DATE_EPOCH for all Python packages so we get invalidation method CHECKED_HASH. This works just fine with ostree since it doesn't care about timestamps. It works out of the box with Python 3.7.0 and higher.

agners commented 5 years ago

I currently looking into a related issue in meta-updater, which creates an OSTree rootfs using OpenEmbedded: https://github.com/advancedtelematic/meta-updater/issues/461

OpenEmbedded has reproducible build support. Setting the SOURCE_DATE_EPOCH to 0 seems to work. However, that caused the build to fail for a Python recipes which tried to pack files into a egg file. It seems that zip only supports timestamps newer than 1980. I think for the OpenEmbedded case it would be good if we can choose what timestamp OSTree uses, so we could choose something more recent.

hroncok commented 3 months ago

I've been debugging a related RHEL 9 issue and I found the following.

On Fedora Silverblue 40 Beta, some of the .pyc files have zeroed mtimes embedded in them.

For example:

>>> pyc = '/usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.pyc'
>>> def py_demarshal_long(b):
...   return (b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24))
... 
>>> with open(pyc, 'rb') as f:
...   chunk = f.read(12)
... 
>>> py_demarshal_long(chunk[8:12])
0

(The code above returns 1689897600 on my non-Silverblue Fedora 39 system.)

This is good because the /usr/lib/python3.12/site-packages/click/__init__.py mtime is zero, so this bytecode cache is valid.

However, apparently, only some of the .pyc files installed in Fedora Silverblue have been altered this way. For example, if I run rpm --verify on the package owning the aforementioned file, I get:

test@fedora:~$ rpm --verify python3-click
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/INSTALLER
.......T.  l /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/LICENSE.rst
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/METADATA
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/WHEEL
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/top_level.txt
.......T.    /usr/lib/python3.12/site-packages/click/__init__.py
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_compat.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/_compat.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_termui_impl.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_termui_impl.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_textwrap.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_textwrap.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_winconsole.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_winconsole.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/core.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/core.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/decorators.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/decorators.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/exceptions.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/exceptions.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/formatting.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/formatting.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/globals.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/globals.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/parser.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/parser.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/shell_completion.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/shell_completion.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/termui.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/termui.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/types.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/types.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/utils.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/utils.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/_compat.py
.......T.    /usr/lib/python3.12/site-packages/click/_termui_impl.py
.......T.    /usr/lib/python3.12/site-packages/click/_textwrap.py
.......T.    /usr/lib/python3.12/site-packages/click/_winconsole.py
.......T.    /usr/lib/python3.12/site-packages/click/core.py
.......T.    /usr/lib/python3.12/site-packages/click/decorators.py
.......T.    /usr/lib/python3.12/site-packages/click/exceptions.py
.......T.    /usr/lib/python3.12/site-packages/click/formatting.py
.......T.    /usr/lib/python3.12/site-packages/click/globals.py
.......T.    /usr/lib/python3.12/site-packages/click/parser.py
.......T.    /usr/lib/python3.12/site-packages/click/py.typed
.......T.    /usr/lib/python3.12/site-packages/click/shell_completion.py
.......T.    /usr/lib/python3.12/site-packages/click/termui.py
.......T.    /usr/lib/python3.12/site-packages/click/testing.py
.......T.    /usr/lib/python3.12/site-packages/click/types.py
.......T.    /usr/lib/python3.12/site-packages/click/utils.py
.......T.  d /usr/share/doc/python3-click/CHANGES.rst
.......T.  d /usr/share/doc/python3-click/README.rst
.......T.  l /usr/share/licenses/python3-click/LICENSE.rst

The T in the output means mtime does not match RPM database (well, it is zero, so I expected this). The 5 in the output means the content differs. Presumably, the .pyc files were regenerated by Python at some point when the filesystem was not read-only but the mtimes of .py files were already zero. Interestingly, only some .pyc files are "corrected", others are not.

See:

test@fedora:~$ python3 -v -c 'import click.testing' 2>&1 | grep '# bytecode is stale for' -A2
# bytecode is stale for 'click.testing'
# code object from /usr/lib/python3.12/site-packages/click/testing.py
# could not create '/usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.pyc': OSError(30, 'Read-only file system')

This is even broken for the standard library:

test@fedora:~$ python3 -v -c 'import mailbox' 2>&1 | grep '# bytecode is stale for' -A2
# bytecode is stale for 'mailbox'
# code object from /usr/lib64/python3.12/mailbox.py
# could not create '/usr/lib64/python3.12/__pycache__/mailbox.cpython-312.pyc': OSError(30, 'Read-only file system')
--
# bytecode is stale for 'urllib'
# code object from /usr/lib64/python3.12/urllib/__init__.py
# could not create '/usr/lib64/python3.12/urllib/__pycache__/__init__.cpython-312.pyc': OSError(30, 'Read-only file system')
--
# bytecode is stale for 'email.generator'
# code object from /usr/lib64/python3.12/email/generator.py
# could not create '/usr/lib64/python3.12/email/__pycache__/generator.cpython-312.pyc': OSError(30, 'Read-only file system')

This seems like a design flaw of the mtime zeroing thing.


When a Python app is running in a restricted SELinux context, any attempt to rewrite a .pyc file is reported, For example, rhsmcertd (from subscription-manager) on RHEL:

[12474.053718] audit: type=1400 audit(1712618630.878:6): avc:  denied  { write } for  pid=2924 comm="rhsmcertd-worke" name="__pycache__" dev="overlay" ino=17360017 scontext=system_u:system_r:rhsmcertd_t:s0 tcontext=system_u:object_r:bin_t:s0 tclass=dir permissive=0
dbnicholson commented 3 months ago

FWIW, ostree proper doesn't do anything with python modules besides changing the modification time like it does for all files. Recompiling must be happening at a different layer in Fedora.

cgwalters commented 3 months ago

A few things here:

At some point, we will switch to entirely relying on composefs which gives us all the advantages of an ostree-like model in terms of on-disk/page cache sharing, without the mtime canonicalization need to ensure hardlinking works.

Today with e.g. podman (without composefs) content-identical files in distinct layers are not shared on disk or memory, partially because of this issue.

Interestingly, only some .pyc files are "corrected", others are not Presumably, the .pyc files were regenerated by Python at some point when the filesystem was not read-only but the mtimes of .py files were already zero.

Hmm. That is odd. I was going to say I wasn't sure how that would be possible, but actually...on the build system, in some cases we may materialize cached filesystem trees using zeroed mtime, and if the Python code happens to be executed at build time, I think that could result in this. Python code not executed at build wouldn't rewrite its mtimes. That would explain some of the mystery here if true.

ericcurtin commented 3 months ago

I've been using composefs with ostree in CentOS Automotive Stream Distribution on a couple of devices the last few weeks and it's been working great FWIW

cgwalters commented 3 months ago

I've been using composefs with ostree in CentOS Automotive Stream Distribution on a couple of devices the last few weeks and it's been working great FWIW

But that's still running rpm-ostree which canonicalizes all mtimes to zero, right? There's a much bigger step to take in this project where we have a mode where we hard require composefs - and actually note that ostree very intentionally doesn't include mtimes in its metadata, so moving towards a model where they are included again would lead more closely to including e.g. OCI container or other metadata as source of truth.

(All of this of course heavily intersects with https://reproducible-builds.org/ - it still doesn't make sense to let mtimes just randomly "float" in images as it will cause spurious changes, even though that's what happens with most image builds (including podman/docker) by default. At least nowdays with RPM one can canonicalize by setting SOURCE_DATE_EPOCH, but that only helps installs that use it, and it's sadly all still defeated because of https://github.com/rpm-software-management/rpm/issues/2219 )

nanonyme commented 3 months ago

This is by the way why freedesktop-sdk and all derivative runtimes use hash-based bytecode, not timestamp-based bytecode. (effectively that SOURC_DATE_EPOCH approach) It works correctly with ostree.

nanonyme commented 3 months ago

Notably semantically correct use of SOURCE_DATE_EPOCH is not to set it to 0 but the timestamp when your sources have last been modified.

frenzymadness commented 3 months ago

This is by the way why freedesktop-sdk and all derivative runtimes use hash-based bytecode, not timestamp-based bytecode. (effectively that SOURC_DATE_EPOCH approach) It works correctly with ostree.

We produce the same RPM content also for standard distributions where timestamp-based invalidation of bytecode cache files is faster.

hroncok commented 3 months ago

Solution ideas:

A. Fedora Python RPM packages can switch from timestamp-based to hash-based invalidation

B. Python timestamp-based invalidation can special case mtime==0

C. ostree can regenerate all .pyc files when building

D. ostree can patch all .pyc files when building

[^1]: Except of course it will be faster than the current Silverblue, but only because this issue. Compared to other solutions presented later, it will still be slower.

travier commented 3 months ago

Thanks for the detailed options! Could option C be implemented by running python -m compileall (https://docs.python.org/3/library/compileall.html) on all .py files in /usr?

How would we do option D?

hroncok commented 3 months ago

Thanks for the detailed options! Could option C be implemented by running python -m compileall (https://docs.python.org/3/library/compileall.html) on all .py files in /usr?

You don't want to run it on all .py files, only those that already have their pyc files. There are .py files in /usr without .pyc files and we want to keep them that way. But otherwise, yes, this is correct.

  1. find all .pyc files we can understand (by path)
  2. find all corresponding sources and note down all their opt-levels we found in (1)
  3. run compileall --hardlink-dupes ... on all files from (2) grouped by opt-level combinations

How would we do option D?

  1. find all .pyc files we can understand (by path)
  2. open each such file and zero some bytes (the thrid 32bit word) if it is a timebased-invalidation .pyc (if the second 32bit word is 0)

If we edit the files in place and we never write anything if the third word is already zero, the hardlinks mentioned earlier should be preserved.


The listed implementations of C a D assume we only care for Python bytecode for the "main" Python version (e.g. 3.12 in Fedora 40). If it needs to support all Pythons available in Fedora, it might be a bit more complex.

travier commented 3 months ago

There are .py files in /usr without .pyc files and we want to keep them that way. But otherwise, yes, this is correct.

Could you explain why we should not do that? Won't those Python files not be optimized as well?

hroncok commented 3 months ago

Only some .py files are importable modules. See https://fedoraproject.org/wiki/Changes/No_more_automagic_Python_bytecompilation#Detailed_Description on why we don't blindly bytecompile them all.

travier commented 3 months ago
  • We (Fedora Python maintainers) have no idea where to plug this in.

For Fedora Atomic Desktops, we can plug this at the end of https://pagure.io/workstation-ostree-config/blob/main/f/fedora-common-ostree.yaml#_104, which is basically a script with full RW access to the system during the build of the ostree commit (image of the system).

hroncok commented 3 months ago

For Fedora Atomic Desktops, we can plug this at the end of https://pagure.io/workstation-ostree-config/blob/main/f/fedora-common-ostree.yaml#_104, which is basically a script with full RW access to the system during the build of the ostree commit (image of the system).

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

travier commented 3 months ago

Let's try to find the rough list of commands to do option C & D:

Option C:

  1. find all .pyc files we can understand (by path)
  2. find all corresponding sources and note down all their opt-levels we found in (1)
  3. run compileall --hardlink-dupes ... on all files from (2) grouped by opt-level combinations
  1. find /usr -iname "*.pyc" ? How do we filter files that can "understand"? What does this means?
  2. I found https://wiki.gentoo.org/wiki/Project:Python/Byte_compiling. Not sure how to do that
  3. Do we only pass the corresponding .py files to this command?

Option D:

  1. find all .pyc files we can understand (by path)
  2. open each such file and zero some bytes (the thrid 32bit word) if it is a timebased-invalidation .pyc (if the second 32bit word is 0)
  1. Same question as above
  2. Could we make a small (python?) script that do it?

Given the above, it feels like option D would be easier to do.

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

Good point, I don't think they are, I'll give it a try.

travier commented 3 months ago

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

I've confirmed that they don't have a 0 mtimes yet as they are not part of the ostree commit yet at this point. So option D is likely the preferred route.

hroncok commented 3 months ago

find all .pyc files we can understand (by path)

all paths matching the following path regexes:

  1. .*/__pycache__/[^/]+\.cpython-312\.pyc$
  2. .*/__pycache__/[^/]+\.cpython-312\.opt-1\.pyc$
  3. .*/__pycache__/[^/]+\.cpython-312\.opt-2\.pyc$

For option C, we would need to find them separately. For option D, we can find them all at once.

For option C, we need to find the source. To do that, we replace the .cpython-312.opt-1.pyc or a similar suffix with .py and walk one directory up (i.e. ..).

I found https://wiki.gentoo.org/wiki/Project:Python/Byte_compiling. Not sure how to do that

Assume we group the sources to the following buckets:

  1. sources paths with .pyc files for levels 0, 1, 2
  2. sources paths with .pyc files for levels 0 and 1
  3. sources paths with .pyc files for level 0 only

(Other combinations should not exist in Fedora, but if we are paranoid, we need to check.)

Now we call:

python3 -m compileall [-j8] -o 0 -o 1 -o 2 -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE FIRST BUCKET>
python3 -m compileall [-j8] -o 0 -o 1      -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE SECOND BUCKET>
python3 -m compileall [-j8] -o 0           -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE LAST BUCKET>

The -j8 option makes it run in parallel. It's optional.


Could we make a small (python?) script that do it?

Absolutely.

hroncok commented 3 months ago
MIN_MAGIC = 3390  # The first magic number supporting PEP 552
ZERO = bytes((0, 0, 0, 0))

def pyc_set_zero_mtime(pyc_path):
    with open(pyc_path, "r+b") as f:
        w = f.read(4)
        if len(w) < 4:
            return

        magic = (w[0] + (w[1] << 8) + (w[2] << 16) + (w[3] << 24)) & 0xFFFF
        if magic < MIN_MAGIC:
            invalidation = ZERO
        else:
            invalidation = f.read(4)
            if len(invalidation) < 4:
                return

        if invalidation == ZERO:
            f.write(ZERO)

Here's a Python function that implements D for one given pyc path. It silently skips invalid (short) pyc files or pyc files with hash-based invalidation modes. It supports pyc files for Python 3.4+

dbnicholson commented 3 months ago

We've been compiling all python modules at ostree build time on Endless for years. But then Debian doesn't include the compiled modules in the packages. Note the hard linking at compile time is mostly pointless here since ostree will already deduplicate the objects when committing.

cgwalters commented 3 months ago

I would still say at a high level there's a conceptual clash between image based updates and Python in this respect. (And in general, anything that is embedding mtimes in files and interpreting that at runtime)

If you think about it...it makes no sense to ship around timestamps that happen to correspond to when a package was built and write it to everyone's disk. In a more ideal world, a file could just not have an modification time at all, i.e. it'd be an Option<Time> in Rust terminology. For files in e.g. /proc the Linux kernel just has the files be the system uptime...which is obviously fine, but also not really useful data to replicate.

(Somewhat more arguably it also doesn't make a lot of sense to have dpkg/rpm write an initial modification time client side either, but...since the files are just sitting there locally mutable and often are mutated by other things (often incorrectly)...)

The other thing here is that the timestamp thing generally doesn't hurt too much for people who are shipping around disk images because they already have to take the 2x storage hit in any general case - trying to deduplicate across disk images leads to insane hacks.

Whereas, ostree is a bit more unique in being a file based system so it's easy for us to dedup.

But yes, longer term, composefs.

nanonyme commented 3 months ago

flatpak-builder contains bytecode timestamp patching code in C, we used it before switching to hash-based

nanonyme commented 3 months ago

@cgwalters it's conceptually not build timestamp. It's .py file modification timestamp. Coincidentally this ends up being build timestamp in RPM but this is coincidence only and not they purpose of the timestamp.

travier commented 3 months ago

I've adapted your script to include finding the files:

#!/usr/bin/env python3

import os
import re

MIN_MAGIC = 3390  # The first magic number supporting PEP 552
ZERO = bytes((0, 0, 0, 0))

def pyc_set_zero_mtime(pyc_path):
    with open(pyc_path, "r+b") as f:
        w = f.read(4)
        if len(w) < 4:
            return

        magic = (w[0] + (w[1] << 8) + (w[2] << 16) + (w[3] << 24)) & 0xFFFF
        if magic < MIN_MAGIC:
            invalidation = ZERO
        else:
            invalidation = f.read(4)
            if len(invalidation) < 4:
                return

        if invalidation == ZERO:
            f.write(ZERO)

if __name__ == "__main__":
    regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

    for root, dirs, files in os.walk("/usr"):
        for file in files:
            if regex.match(file):
                print(file)
                # pyc_set_zero_mtime(file)

But it fails on:

./test.py:27: SyntaxWarning: invalid escape sequence '\.'
  regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

Investigating but someone with more Python experience might find the answer faster than me :).

hroncok commented 3 months ago
-    regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")
+    regex = re.compile(r".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")
hroncok commented 3 months ago

Also, file is only the filename, not the path, so the regex won't ever match.

hroncok commented 3 months ago
if __name__ == "__main__":
    regex = re.compile(r".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

    for root, dirs, files in os.walk("/usr"):
        for file in files:
            path = os.path.join(root, file)
            if regex.match(path):
                print(path)
                # pyc_set_zero_mtime(path)
travier commented 3 months ago

Looks better now indeed. Thanks

travier commented 3 months ago

PR to add the workaround for Fedora Atomic Desktops: https://pagure.io/workstation-ostree-config/pull-request/505

nanonyme commented 3 months ago

Something also to keep in mind with long-term solutions Fedora might move to checked hash as part of https://fedoraproject.org/wiki/Changes/ReproduciblePackageBuilds. The performance hit is quite negligible price for reproducible bytecode.

hroncok commented 3 months ago

That's unlikely. We use the timestamp invalidation mode and we clamp mtimes to SOURCE_DATE_EPOCH (or whatever is that called).

cgwalters commented 2 months ago

OK so there's basically an impedance mismatch between the current container -> ostree flow. In the more medium term what I'm thinking is that we will move away from the existing ostree-container model implemented in https://github.com/ostreedev/ostree-rs-ext/ to one where the canonical storage is composefs backed. There's a lot going on on this topic, but I started on https://github.com/containers/composefs/pull/286 related to this, which would be a potential new composers-native storage for both podman and bootc/ostree.

cgwalters commented 1 month ago

xref https://github.com/keszybz/add-determinism/ which is a tool which forcibly canonicalizes python timestamps; it'd probably be good to glom onto that.

keszybz commented 2 weeks ago

At least nowdays with RPM one can canonicalize by setting SOURCE_DATE_EPOCH, but that only helps installs that use it

Well, "installs that use it" are … all installs? Fedora/Suse/anybody-else-who-cares-about-reproducible-rpm-builds set source_date_epoch_from_changelog 1. So mtimes don't "randomly float" — they are set to either the mtimes in the sources (when older than the packaging update) or clamped to the packaging timestamp. IMO this is very good outcome. (The last rebuild to test reproducibility indicates that that are some packages/packaging workflows which mess up mtimes, but those are already rather rare, and usually easily fixed.)


I came here from https://github.com/keszybz/add-determinism/pull/26.

Generally, it looks like something that can be added to add-det without too much trouble. In particular, add-det has logic to preserve hard links, which came up in the discussion. (Even if the answer is that it doesn't matter for rpm-ostree, I think it's nicer to preserve them, not least because rpm -V doesn't become unhappy.)

So let's say that we implement add-det --clear-pyc-mtimes. I'm not clear on what add-det is supposed to do in that case: just clear bytes 8..12 if mtime==0 on the file, or clear bytes 8..12 but ignore the mtime, or clear those bytes and also run the full normalization procedure. It should be pretty trivial to implement any of those modes, but I'm not sure what is wanted…

travier commented 2 weeks ago

This comes from https://gitlab.com/fedora/bootc/tracker/-/issues/3 and the current workaround code is in https://pagure.io/workstation-ostree-config/pull-request/505.

keszybz commented 2 weeks ago

OK, so after looking at the script in 505, it seems that add-det should set embedded mtime to 0 unconditionally, and ignore mtimes.

prestist commented 2 weeks ago

Well, Im not sure we want to unconditionally set mtime to 0. Other non-ostree systems likely dont need or want that. I think having it set as an argument which can be triggered by a caller who knows what they want is best. The logic has to remain fairly conditional as we use the conditional logic to verify the version of python before trying to replace any bytes since the location that the mtime is placed is version dependent.

keszybz commented 1 week ago

Right, sorry, my comment was not clear.

What I wanted to say is: when invoked by ostree for the special ostree cleanup add-det should set embedded mtime to 0 unconditionally, and ignore mtimes.

keszybz commented 1 week ago

Please see if https://github.com/keszybz/add-determinism/pull/27 would work for you: add-determinism --handler=pyc-zero-mtime /some/path/.