project-stacker / stacker

Build OCI images natively from a declarative format
https://stackerbuild.io
Apache License 2.0
208 stars 34 forks source link

Bug: cache use problem with build_only layers single `--layer-type` #442

Open smoser opened 1 year ago

smoser commented 1 year ago

stacker version

v1.0.0-rc4-8e267fc

Describe the bug

This issue was first described in #431 We made a valid fix there, but but it did not fix the issue here.

When using build_only: true for as under-layers stacker can fail to setup a valid container. The fact that the original docker layer was a 'tar' layer is also likely related.

The following comment string in the beginning of lxcRootfsString in pkg/overlay/metadata.go here is not correct for all use cases:

// find any manifest to mount: we don't care if this is tar or // squashfs, we just need to mount something. the code that generates // the output needs to care about this, not this code. // // if there are no manifests (this came from a tar layer or whatever), // that's fine too; we just end up with two workaround directories as // below

lxcRootfsString will ovl.Manifests dictionary and pick the first manifest it finds. In the case where stacker is only building squashfs a stacker file like below will fail if the dictionary traversal does not select 'squash+true' first.

minbase:
  build_only: true
  from:
    type: docker
    url: docker://busybox:latest
  run: |
    echo hello > /minbase.txt

rootfs:
  from:
    type: built
    tag: minbase
  run: |
    [ -e /minbase.txt ]

The problem can be seen when reading the serialized overlay_metadata.json in roots/minbase/overlay_metadata.json the 'tar+false' entry is missing a layer (it has only 1, where the squashfs+true entry has 2). The file below is trimmed.

{
    "Manifests": {
        "squashfs+true": {
            "schemaVersion": 2,  
            "config": {
                "mediaType": "application/vnd.oci.image.config.v1+json",
                "digest": "sha256:6f915f...3c821cd1688dc",
                "size": 576
            },
            "layers": [
                {
                    "mediaType": "application/vnd.stacker.image.layer.squashfs+zstd+verity",
                    "digest": "sha256:243c9d7...f482880",                
                    "size": 2301952,
                    }
                },  
                {
                    "mediaType": "application/vnd.stacker.image.layer.squashfs+zstd+verity",
                    "digest": "sha256:ad18d87c6...1a58280252",                
                    "size": 8192,
                }
            ]
        },
        "tar+false": {
            "schemaVersion": 2,
            "config": {
                "mediaType": "application/vnd.oci.image.config.v1+json",
                "digest": "sha256:3488e6e2e...0edb4b6cc7",
                "size": 575
            },
            "layers": [
                {   
                    "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
                    "digest": "sha256:1487bff95...bc5621",                
                    "size": 2592227
                }
            ]
        }
    },
...

To reproduce

The attached recreate.sh will reproduce the bug.

It reads the following environment variables:

Changing the value of BUILD_ONLY to 'false' or LAYER_TYPES to 'squashfs,tar' (or 'tar,squashfs') will cause the issue to not reproduce.

The problem only occurs with stacker files that have 'build_only: true' and are built '--layer-type=squashfs'.

Additional context

My bootkit project builds artifacts using stacker. It organizes these artifacts into a few layers that are to be published. It heavily uses 'build_only: true' and uses 'stacker publish' to publish the layers.

Due the this bug bootkit c-i build sees transient failures.

My options to avoid the bug are:

Both of these options will incur a lot of extra cpu and io and the second one requires maintaining a list of what to publish in some place other than stacker.yaml

smoser commented 1 year ago

Just for ease of viewing, i'll describe what 'recreate.sh' does. It basically just loops over a build of the following stacker.yaml defining NUMBER each time so that 'rootfs' is forced to be built.

It may be relevant that 'docker://busybox:latest' is initially a tar layer that gets converted to squashfs by stacker.

stacker.yaml:

minbase:
  build_only: true
  from:
    type: docker
    url: docker://busybox:latest
  run: |
    echo hello > /minbase.txt

rootfs:
  from:
    type: built
    tag: minbase
  run: |
    n=${{NUMBER}}
    [ -e /minbase.txt ] && echo "run $n good" ||
        { echo "run $n bad"; exit 1; }

And then:

n=0
while [ $n -lt 50 ] && n=$((n+1)); do
    stacker build --substitute=NUMBER=$n || exit
done
smoser commented 1 year ago

@hallyn , did you think is fixed by #454 ? if so, can we validate that and close?

hallyn commented 1 year ago

It doesn't fix it. It fails after a random number of iterations - my last attempt hit

+ echo 'run 23 bad'
run 23 bad
hallyn commented 1 year ago

https://pastebin.com/PM24dtr4 shows the backtrace.

hallyn commented 1 year ago

Perhaps this failure is due to using fuse for atomfs without the mount-is-ready notification channel.

smoser commented 1 year ago

Perhaps this failure is due to using fuse for atomfs without the mount-is-ready notification channel.

its not. it is golang dict ordering.