vmware-archive / dependency-labeler

Dependency Labeler adds metadata about a container image's dependencies as a label to the container image. Formerly maintained by the NavCon team, currently maintained by the Source Insight Tooling team
Other
10 stars 5 forks source link

Deplab is not idempotent #5

Closed XanderStrike closed 5 years ago

XanderStrike commented 5 years ago

When running deplab as part of a script, it would be nice to be able to re-run the script without having deplab change the digest.

I can't actually tell what deplab is changing about the image each time you run it, but it returns a different digest every time. Ideally deplab should be able to detect if it's already added metadata to a container, and complete instantly.

XanderStrike commented 5 years ago

Actually on further investigation it seems like the digest isn't even actually the digest for the image. This is pretty confusing since it looks like an image digest 😐

$ deplab -g /tmp/istio -i gcr.io/cf-routing/pilot:1.3.0 -t gcr.io/cf-routing/pilot:1.3.0
sha256:fd2f51805d46eaa37c1544e3a9cdaccffb245f141874763adc297562d1c4f964

$ docker push gcr.io/cf-routing/pilot:1.3.0
The push refers to repository [gcr.io/cf-routing/pilot]
aa1df9b87be4: Layer already exists
a1d0ac46357e: Layer already exists
f1842ad5b70f: Layer already exists
e79142719515: Layer already exists
aeda103e78c9: Layer already exists
2558e637fbff: Layer already exists
f749b9b0fb21: Layer already exists
1.3.0: digest: sha256:334dbe446972a71bea0fd9a3c9ad2e89e52e613d660d356ab703e3405e5a478b size: 1786
stonish commented 5 years ago

The value output by deplab is actually the Image ID for the newly created image as opposed to the repo digest. For example, we have Istio images in GCR without deplab metadata:

$ docker pull gcr.io/cf-navcon/istio/pilot:latest
$ docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
gcr.io/cf-navcon/istio/pilot   latest              85af27504b36        4 months ago        283MB
$ docker inspect gcr.io/cf-navcon/istio/pilot:latest | jq '.[0] | .Id, .RepoDigests'
"sha256:85af27504b36a4117d96e74a294b62d100e92fef9a3830b8c82a4137d92cbab2"
[
  "gcr.io/cf-navcon/istio/pilot@sha256:428959abad1e61f4adbfb3b35ec61d85c952c7c46696e15adc7b3863eb9719f5"
]

Then, running deplab against this image we get a new image without a repo digest:

$ deplab -g ~/workspace/stacks/ -i gcr.io/cf-navcon/istio/pilot:latest -t gcr.io/cf-navcon/istio/pilot:deplab
sha256:19cb06772618d7761e03c25657bb5bfc83dfdd0feecf1d0a7f372bac8e4b6851
$ docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
gcr.io/cf-navcon/istio/pilot   deplab              19cb06772618        5 minutes ago       283MB
gcr.io/cf-navcon/istio/pilot   latest              85af27504b36        4 months ago        283MB
$ docker inspect gcr.io/cf-navcon/istio/pilot:deplab | jq '.[0] | .Id, .RepoDigests'
"sha256:19cb06772618d7761e03c25657bb5bfc83dfdd0feecf1d0a7f372bac8e4b6851"
[]

When pushing this back to GCR, the ID does not change but we get a new RepoDigest:

$ docker push gcr.io/cf-navcon/istio/pilot:deplab
The push refers to repository [gcr.io/cf-navcon/istio/pilot]
f1a49cc17b70: Layer already exists
7892b678008a: Layer already exists
4542e6d725c6: Layer already exists
c668b1d642cb: Layer already exists
d908c11b1a82: Layer already exists
7ccfaa7554e3: Layer already exists
89ec57aea3bf: Layer already exists
a0c1e01578b7: Layer already exists
deplab: digest: sha256:29b33cd01e6e36bd461a316b7ca1c5c7da9de248018778e23363c2208b292664 size: 1995
$ docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
gcr.io/cf-navcon/istio/pilot   deplab              19cb06772618        5 minutes ago       283MB
gcr.io/cf-navcon/istio/pilot   latest              85af27504b36        4 months ago        283MB
$ docker inspect gcr.io/cf-navcon/istio/pilot:deplab | jq '.[0] | .Id, .RepoDigests'"sha256:19cb06772618d7761e03c25657bb5bfc83dfdd0feecf1d0a7f372bac8e4b6851"
[
  "gcr.io/cf-navcon/istio/pilot@sha256:29b33cd01e6e36bd461a316b7ca1c5c7da9de248018778e23363c2208b292664"
]

Would it reduce potential confusion if the output from deplab clearly stated that this was the image ID?

In terms of making deplab idempotent, we have put some thought into this but it is not entirely straightforward. We could simply inspect the image metadata when deplab was called, check for the presence of the io.pivotal.metadata label and exit if it already exists. However, suppose someone has already labelled an image using deplab, then used that image as the base for another image which adds a bunch of additional packages. If deplab were to exit upon finding that the io.pivotal.metadata label already exists, the metadata would be missing those additional packages. With our current implementation on the other hand, deplab will overwrite the io.pivotal.metadata label in this case with an accurate description of this new image and all of its packages.

An alternative approach would be for deplab to always extract the full metadata from an image, then compare this to any metadata labels already present on that image and only re-label the image if a discrepancy has been found. This would ensure that the ID of the image would only change if deplab needed to create a new layer with fresh metadata.

I'm curious about the use-case though. So far, we have been assuming that deplab would only be run for an image once. This would happen in some automated process essentially immediately after the image has been built and before that new image get's pushed to a registry. In that case, we imagined that if deplab was run on an image which already had the io.pivotal.metadata label, it would only be because something had changed and the image needed to be re-labelled. If this isn't true for Istio, I'd like to learn more about what the flow would actually be like. (I'm happy to chat over Zoom if it's easier.)

XanderStrike commented 5 years ago

Fabulous! I didn't realize it was a different SHA, my Docker knowledge shortcomings are showing!

The use case (if you can call it that) is I'm running deplab once right after building the image to label and tag it, and once again later to get the dpkg list, and I was surprised to see the SHAs change. I was mistakenly using the SHA returned by deplab to try to uniquely identify the image (i.e. pulling out an image by digest), but that needs to be the repo SHA. Switching to this solved my issue.

I still think idempotency is generally valuable on tools like this but I don't have a valid use case for it at the moment, so we can consider this issue resolved. Thanks!

stonish commented 5 years ago

Sounds good, thanks!

FWIW, we currently recommend using deplab once to label the image and generate the dpkg list at the same time. We could consider a feature in future which would allow you to generate the dpkg list while explicitly skipping the image labelling step. However, we're actually planning to remove the need for the dpkg list file from the OSL process soon and rely solely on the metadata on the image.

I agree in general that making deplab idempotent would be nice so I've added a story for this to our icebox. It may not be a high priority for us right now though. I've also added a story to make deplab's output easier to understand.