Open axsuul opened 2 years ago
Hi @axsuul! It looks like this error is bubbling up from coordinator.go#L215
, which suggests that the image is getting pulled just fine but then when we inspect the image that we've just pulled, that's when it's failing. I'll see if I can reproduce this and come back with a workaround if I can find one.
Thanks for looking into this @tgross !
Still seeing this in 1.2.5
@axsuul yeah we haven't had a chance to address it yet. We'll update here once we have more info.
Hi @axsuul! I chased this down and the tl;dr is that the sha
you're passing doesn't match the tag
, so it's failing as intended!
I reproduced as follows. If we pull the image via the Docker CLI, we get the expected checksum:
$ docker pull meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
docker.io/meisnate12/plex-meta-manager@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787: Pulling from meisnate12/plex-meta-manager
...
Digest: sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
Status: Downloaded newer image for meisnate12/plex-meta-manager@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
docker.io/meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
$ docker images --digests
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
meisnate12/plex-meta-manager <none> sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787 15652afd5f63 4 weeks ago 487MB
But notice the tag develop
is no longer pointing to this checksum!
Next let's remove that image with docker rmi 15652afd5f63
and run our job. the client logs look like the following, which matches what you saw:
2022-02-09T19:30:27.758Z [DEBUG] client.driver_mgr.docker: image pull progress: driver=docker image_name=meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787 message="Pulled 9/9 (179.8MiB/179.8MiB) layers: 0 waiting/0 pulling" 2022-02-09T19:30:31.110Z [DEBUG] client.driver_mgr.docker: docker pull succeeded: driver=docker image_ref=meisnate12/plex-meta-manager:develop 2022-02-09T19:30:31.111Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787 error="no such image" 2022-02-09T19:30:31.111Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=0714c356-4aed-5fc1-6310-fb23f895cb29 task=run error="no such image"
But an image was pulled with the develop
tag, it just doesn't match the checksum!:
$ docker images --digests
REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE
meisnate12/plex-meta-manager develop sha256:bbd6e585fd7a15fe04dd37ec110c763904b8455f09621337ce65a4c0cc44377f 1b2b1a929215 21 hours ago 488MB
When I went digging through the parseDockerImage
function that reads the tag and sha from the image
field, I discovered this an intentional behavior of the Docker client library we're using. I'm inclined to agree with their choice rather than what the Docker CLI does, as I would expect the checksum to match the tag!
See also image_test.go#L500-L535
if you want to know more about their rationale behind that, but they're unlikely to change it now because of backwards compatibility.
So while I think might be worth us updating the image
field docs to make this behavior clear, this is Nomad doing the right thing and protecting you from using a tag that doesn't match the checksum you gave it.
Hey @tgross really appreciate you looking into this!
Your reply makes sense but I guess I'm now a bit confused about Docker digests and how they work. Note that how I am referencing these digests worked fine in Kubernetes, only Nomad has been giving me issues about this.
For example, taking the postgres image, and using the latest
tag with its current digest for amd64
docker pull postgres:latest@sha256:47bc053e5d09cd5d2b3ff17a1e2142f7051a6d40f9f8028b411f88bf7f973265
docker images --digests | grep postgres
returns with <none>
postgres <none> sha256:47bc053e5d09cd5d2b3ff17a1e2142f7051a6d40f9f8028b411f88bf7f973265
Does that mean the tag latest
is not pointing to this checksum? From Docker Hub it does seem to be that the checksum is attached to that tag, or am I misreading here?
Hmm okay, what makes this even more odd is this image pulls fine with a digest
job "librespeed" {
datacenters = ["dc1"]
reschedule {
delay_function = "constant"
}
group "librespeed" {
network {
mode = "bridge"
}
task "run" {
driver = "docker"
config {
image = "adolfintel/speedtest:latest@sha256:df648683e8e03223254b3f4955976b93a328a5773f3ca8531e84afb7a55c8a01"
}
}
}
}
So something seems to be off with the meisnate12/plex-meta-manager
image in particular
Your reply makes sense but I guess I'm now a bit confused about Docker digests and how they work. Note that how I am referencing these digests worked fine in Kubernetes, only Nomad has been giving me issues about this.
The behavior is a little odd, and likely stems from (1) checksums being added late in the Docker API, and (2) tags being mutable vs checksums being immutable. If you take a look at the Register API, the manifest request takes a reference
parameter, which can be a tag or a checksum. So when either Docker or Nomad is given an image like repo:tag@sha256:checksum
, it effectively throws away the tag and uses the checksum to get the manifest.
In Docker, they must be throwing away the tag entirely which is why you're seeing <none>
in the output. But in Nomad when we InspectImage
we're using the whole image ID that was passed, instead of the one that's parsed. And presumably K8s is doing what Docker does.
So while we're doing the "technically correct" thing here in assuming you really meant the tag + sha, it diverges noticeably from how Docker does it. I'm going to bring this back to the rest of the Nomad team and see what they think about whether we should treat this divergence as a bug.
We had a little chat internally and we're going to treat Nomad's behavior as a bug. We'd probably like to catch this at job validation time, but anything in the config
block only gets checked on the client so we wouldn't be able to surface that nicely to the user. And we can make the change without causing breaking anyone's existing jobspecs.
To close out this bug we'll: use the parseImage
logic to detect when we have both a tag and a sha, and use only the sha or the tag when making the subsequent InspectImage
call. I've marked this for roadmapping.
Thanks, note I'm still having seeing this bug in 1.4.3
still seeing this with 1.5.9:
2024-01-06T18:00:06.237Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=alpine:latest@sha256:eece025e432126ce23f223450a0326fbebde39cdf496a85d8c016293fc851978 error="no such image"
2024-01-06T18:00:06.237Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=8087424f-84e7-5497-bbb5-e90519a0a195 task=backup type="Driver Failure" msg="no such image" failed=false
2024-01-06T18:00:06.242Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=8087424f-84e7-5497-bbb5-e90519a0a195 task=backup error="no such image"
2024-01-06T18:00:23.322Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=alpine:latest@sha256:eece025e432126ce23f223450a0326fbebde39cdf496a85d8c016293fc851978 error="no such image"
2024-01-06T18:00:23.323Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=051ff74c-19c1-d4da-42d4-f4aca81a722b task=backup type="Driver Failure" msg="no such image" failed=false
2024-01-06T18:00:23.325Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=051ff74c-19c1-d4da-42d4-f4aca81a722b task=backup error="no such image"
Nomad version
Operating system and Environment details
Ubuntu 20.04
Issue
Nomad seems to be having trouble pulling images with digests still. I see that https://github.com/hashicorp/nomad/issues/4211 was resolved awhile ago but I'm still experiencing something off with this particular image.
If I try to manually pull the image on the client, it works
Then running the job again works since the image is now on the client.
Reproduction steps
Attempt to run job, see that image cannot be pulled. Then pull Docker image and run job again and it runs successfully.
Expected Result
Image with digest can be pulled
Actual Result
Image with digest cannot be pulled
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)