replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.07k stars 561 forks source link

Support reproducible builds #1250

Open salrashid123 opened 1 year ago

salrashid123 commented 1 year ago

Cog currently uses docker to build the images

however, docker based builds are not reproducible: you'll get different image hashes even with the identical config

this long-term feature request is to refactor the build system from docker to something like


some references building using kaniko and bazel

technillogue commented 1 year ago

Hi, we've investigated this - SOURCE_DATE_EPOCH is a promising direction, and we tried approaches with resetting mtime for everything. Unfortunately, pip install is fundamentally irreproducible, because it generates pyc files that include the timestamp. Unzipping wheels without using pip might make this possible, or I think there's some PEPs in the works that might help with this. See https://github.com/pypa/pip/issues/5648

salrashid123 commented 1 year ago

got it; i think esp with python it'd be difficult to do with its own toolchains.

maybe generating the docker file per https://github.com/replicate/cog/issues/1241#issuecomment-1660128528

and then chaining it to off the shelf kaniko would be sufficient workaround ( ref

docker run   \
  -v `pwd`:/workspace -v $HOME/.docker/config_docker.json:/kaniko/.docker/config.json:ro \
   -v /var/run/docker.sock:/var/run/docker.sock \
     gcr.io/kaniko-project/executor@sha256:034f15e6fe235490e64a4173d02d0a41f61382450c314fffed9b8ca96dff66b2    \
     --dockerfile=Dockerfile \
     --reproducible   \
         --destination "docker.io/salrashid123/tpmds:server"       --context dir:///workspace/

i realize now we're involving kaniko as well but it maybe easier to delegate it like this for now

technillogue commented 1 year ago

Would that address the pyc timestamps?

salrashid123 commented 1 year ago

i think so, as part of the kaniko reproducible builds, it sets up snapshots resetting the all file times.

tried it from the getting started guide and using the generated Dockerfile seems to always reference a file like

COPY .cog/tmp/build1866459875/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl

which doesn't exist and causes the kanilo to fail

cog build
cog debug  > Dockerfile

docker run     -v `pwd`:/workspace -v $HOME/.docker/config_docker.json:/kaniko/.docker/config.json:ro    -v /var/run/docker.sock:/var/run/docker.sock      gcr.io/kaniko-project/executor@sha256:034f15e6fe235490e64a4173d02d0a41f61382450c314fffed9b8ca96dff66b2        --dockerfile=Dockerfile     --reproducible          --destination "docker.io/salrashid123/cogdemo:server"       --context dir:///workspace/

salrashid123 commented 1 year ago

oh, so python embeds the timestamp inside the file...then kaniko isn't gonna help out.

...and i can't sincerely recommend going all out on and investing in python-bazel builds

technillogue commented 1 year ago

as a stopgap for the debug issue, run cog build once and then interrupt, it will place a cog wheel in .cog/tmp/whatever, and then you can edit the cog debug output

does python-bazel address pyc timestamps somehow? does it just strip pyc files?

it would be incredibly helpful for us to get reproducible builds for deduplication

salrashid123 commented 1 year ago

yeah, i tried the interrupt trick suggested but each cog+kaniko build is different hash (which is expected, i tihnk)

i'm unsure exactly how bazel rules_python handles pyc files but i can say you need to precisely define everything upfront and bazel uses its own sandbox to canonicalize everything.

some examples with rules_python which may help answer the question though....once it works with rules_python, stitching it with rules_docker and containers would be easy

https://github.com/bazelbuild/rules_python/tree/main/examples

charles-dyfis-net commented 1 year ago

Y'all might also investigate Nix (which provides dockerTools, an alternate build tool for Docker images) towards this end.

Nix converts all timestamps to one second past epoch, btw.

technillogue commented 1 year ago

Does rules_python generate pyc at all? https://github.com/bazelbuild/rules_python/issues/1761

Again, there's no issue with mtimes, the problem is the timestamps embedded in pyc files

charles-dyfis-net commented 1 year ago

Does rules_python generate pyc at all? bazelbuild/rules_python#1761

Again, there's no issue with mtimes, the problem is the timestamps embedded in pyc files

The NixOS install CD is fully binary reproducible. I can't imagine it not including Python, so clearly they've got that licked somehow.

Indeed, quoting:

   # Determinism: The interpreter is patched to write null timestamps when compiling Python files
   #   so Python doesn't try to update the bytecode when seeing frozen timestamps in Nix's store.
   export DETERMINISTIC_BUILD=1;
technillogue commented 1 year ago

then we would have to ship nix's patched interpreter, right? DETERMINISTIC_BUILD is not present in stock python

salrashid123 commented 1 year ago

her'es an end-to-end covering building an image with bazel and serving with cog.

if precise build steps are followed, you should end up with

(i verified it on two different clean vms)

as mentioned, using bazel is really tedious though toolchains like gazelle may help with python. (imo as-is in current state, the developer friction all this introduces negates the primary ease-of-use benefits of using/building w/ cog in the first place)

[tbh, i've never used or needed cog and try to not use bazel for deterministic builds (in go there are easier ways)...this issue with cog was something i noticed and then ratholed academically.]

RyzeNGrind commented 8 months ago

I would like to add my +1 for supporting reproducible builds via Nix and NixOS as well.

technillogue commented 8 months ago

https://github.com/datakami/cognix is a project that exists and kind of works but unfortunately isn't a priority for us at this time