nlewo / nix2container

An archive-less dockerTools.buildImage implementation
Apache License 2.0
501 stars 45 forks source link

Build within nix2container image: "Digest did not match" from Skopeo #97

Closed kolloch closed 9 months ago

kolloch commented 10 months ago

Problem

When building docker images with nix2container within an image build with nix2container, I am getting weird errors like this:

Running skopeo --insecure-policy copy nix:/nix/store/47g6iqca8nhldrrin16imczai5fcj929-image-nix-ci.json $@
Getting image source signatures
Copying blob 440aaf385e0f done  
Copying blob afaeec1f88a2 done  
Copying blob 59f7db87f4ab done  
Copying blob 00a004aa7f8b done  
Copying blob 20ea6e6cb56a done  
FATA[0000] writing blob: happened during read: Digest did not match, expected sha256:20ea6e6cb56aa5dc118ae4b4fbeddbb221ddca60a5e13677fee3c44f079ee3fe, got sha256:17ce2732d9ad185f1ad8e3705211fd7a94dea9888d63a57c4e1e1f1f48c67c0a 

Reproduction

This repo contains everything to reproduce the error for me.

With docker and local copy:

  1. nix run -L .\#containers.x86_64-linux.nix-ci.copyToDockerDaemon
  2. docker run --privileged -v $PWD:/workspace -v ~/.docker:/home/user/.docker --workdir /workspace -it $(nix eval --raw ".#containers.x86_64-linux.nix-ci.imageRefUnsafe") nix run .\#containers.x86_64-linux.nix-ci.copyTo oci:oci_sample_inside

Ideas

I wonder if it is some sort of store path corruption?

I build it with and without https://github.com/nlewo/nix2container/pull/96:

kolloch commented 10 months ago

If I compare inside/outside of the container, I get e.g. this difference:

result_store_inside/nix/store/rc4p7zzhr481wl1vfh3zmr5kgb26shyc-layers.json/layers.json
--- result_store_outside/nix/store/rc4p7zzhr481wl1vfh3zmr5kgb26shyc-layers.json/layers.json 1970-01-01 01:00:01.000000000 +0100
+++ result_store_inside/nix/store/rc4p7zzhr481wl1vfh3zmr5kgb26shyc-layers.json/layers.json  1970-01-01 01:00:01.000000000 +0100
@@ -1,8 +1,8 @@
 [
    {
-       "digest": "sha256:17ce2732d9ad185f1ad8e3705211fd7a94dea9888d63a57c4e1e1f1f48c67c0a",
+       "digest": "sha256:20ea6e6cb56aa5dc118ae4b4fbeddbb221ddca60a5e13677fee3c44f079ee3fe",
        "size": 490496,
-       "diff_ids": "sha256:17ce2732d9ad185f1ad8e3705211fd7a94dea9888d63a57c4e1e1f1f48c67c0a",
+       "diff_ids": "sha256:20ea6e6cb56aa5dc118ae4b4fbeddbb221ddca60a5e13677fee3c44f079ee3fe",
        "paths": [
            {
                "path": "/nix/store/x4xwf94jik4dmw15q3m2vihi7kgn7qxf-policy.json"
kolloch commented 10 months ago

And the whole image json. Interestingly, the store paths are NOT different but the hashes are.

--- result_store_outside/nix/store/47g6iqca8nhldrrin16imczai5fcj929-image-nix-ci.json   1970-01-01 01:00:01.000000000 +0100
+++ result_store_inside/nix/store/47g6iqca8nhldrrin16imczai5fcj929-image-nix-ci.json    1970-01-01 01:00:01.000000000 +0100
@@ -229,9 +229,9 @@
            "mediatype": "application/vnd.oci.image.layer.v1.tar"
        },
        {
-           "digest": "sha256:17ce2732d9ad185f1ad8e3705211fd7a94dea9888d63a57c4e1e1f1f48c67c0a",
+           "digest": "sha256:20ea6e6cb56aa5dc118ae4b4fbeddbb221ddca60a5e13677fee3c44f079ee3fe",
            "size": 490496,
-           "diff_ids": "sha256:17ce2732d9ad185f1ad8e3705211fd7a94dea9888d63a57c4e1e1f1f48c67c0a",
+           "diff_ids": "sha256:20ea6e6cb56aa5dc118ae4b4fbeddbb221ddca60a5e13677fee3c44f079ee3fe",
            "paths": [
                {
                    "path": "/nix/store/x4xwf94jik4dmw15q3m2vihi7kgn7qxf-policy.json"
@@ -295,9 +295,9 @@
            "mediatype": "application/vnd.oci.image.layer.v1.tar"
        },
        {
-           "digest": "sha256:9db223ef3ca7b32de6b8bd43488490aad02c5c3a0b41a1a6be015ff1ff1cb26a",
+           "digest": "sha256:440aaf385e0f8881fc63c339b236cf9e67320b93516abaacd6c73802052bfaf8",
            "size": 8797696,
-           "diff_ids": "sha256:9db223ef3ca7b32de6b8bd43488490aad02c5c3a0b41a1a6be015ff1ff1cb26a",
+           "diff_ids": "sha256:440aaf385e0f8881fc63c339b236cf9e67320b93516abaacd6c73802052bfaf8",
            "paths": [
                {
                    "path": "/nix/store/yyp0vrv39vgir6w2riq5s4cnjnxk581p-mkDirs"
kolloch commented 10 months ago

I compared that path values in the ValidPath table inside the container with the contents of the /nix/store directory -- and the paths were the same. (with patch) (did not check the contents)

nlewo commented 10 months ago

I will take a look later, but to debug such kind of issues, i will do something such as explained in this comment: https://github.com/nlewo/nix2container/issues/23#issuecomment-1147710161

This would allow you to compare files in the layer produced at build time to the files in the layer produced at push time.

kolloch commented 10 months ago

Hi @nlewo,

Thanks for the tip! I enabled it in the test repo and also added "scripts" for reproducing the same expression in the two different docker containers.

See https://gitlab.com/nexxiot-labs/nix2container-checksum/-/blob/5c0925e48dd6ff12b5584187c26c64b88719dd76/README.md for instructions and results.

I guess that the nix-database derivation is not reproducible is somewhat unsurprising. but the two others -- weird.

(I also had a read around your code. Nice job! Unfortunately, didn't really see an issue.)

kolloch commented 10 months ago

Hi @nlewo,

Tried to debug this some more and wrote my findings in here https://gitlab.com/nexxiot-labs/nix2container-checksum/-/blob/465dd01ba20887fea1457c4205713b7ae291b99a/README.md

Everything is hopefully reproducible by executing commands such as

nix run .\#x86_64-linux.gitlab.runnables.debug-building-nix-ci-in-nix-ci

Which also contain debug info.

Maybe the most surprising result first: All layers except the top-level layer are actually the same between working and non-working builds. Only the top-level image layer also containing the nix-database differs.

If you add a package to that top-level layer (I did that with hello), then the build for that container succeeds.

I assume that some incorrect substitution happens otherwise, not sure.

The non-working solution also have content-addressable hashes in the closuregraph for some reason.

kolloch commented 10 months ago

I'd be happy to debug that together. I am in the CET time zone and are rather flexible.

kolloch commented 10 months ago

Hi @nlewo,

I updated my PR quite significantly:

https://github.com/nlewo/nix2container/pull/96#issuecomment-1806394908

With this PR in place, most things are deterministic.

Very weirdly, if you set reproducible=false, then the container builds. If not, the old error.

I stared long and hard at functions such as newLayers and the core piece of the digest/tar implementation: TarPaths

But they honestly look quite nicely written and good to me :shrug:

kolloch commented 10 months ago

I also updated the README/code of https://gitlab.com/nexxiot-labs/nix2container-checksum so that it should be quite straight-forward to repo. At least for me, it is fully reproducible.

kolloch commented 10 months ago

Crazy idea: One of the differences I observed in the closure-graph.json was that in the failing case I saw content-adressed hashes.

What if, the content-adressing somehow partially rewrites existing derivations and thus changes the digest when later read?

Serializing the file into a tar (reproducible = false) works around this issue, since the tar at least doesn't change.

kolloch commented 9 months ago

Within the nix-ci container with nix running as user, I can reproduce the issue by:

nix run .#aarch64-linux.gitlab.containers.only-stdenv.copyTo oci:test-oci

from nix2container-checksum which is build by:

only-stdenv = nix2container.buildImage {
    name = "only-stdenv";
    copyToRoot = [nixpkgs.stdenv];
  };

With my fixes from https://github.com/nlewo/nix2container/pull/96 and without stdenv, it actually works now! :)

kolloch commented 9 months ago

I cannot reproduce the issue anymore 🤷