hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.75k stars 1.94k forks source link

Docker image with digest doesn't pull correctly #11821

Open axsuul opened 2 years ago

axsuul commented 2 years ago

Nomad version

Nomad v1.2.3 (a79efc8422082c4790046c3f5ad92c542592a54f)

Operating system and Environment details

Ubuntu 20.04

Issue

Nomad seems to be having trouble pulling images with digests still. I see that https://github.com/hashicorp/nomad/issues/4211 was resolved awhile ago but I'm still experiencing something off with this particular image.

If I try to manually pull the image on the client, it works

docker pull meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787

Then running the job again works since the image is now on the client.

Reproduction steps

Attempt to run job, see that image cannot be pulled. Then pull Docker image and run job again and it runs successfully.

Expected Result

Image with digest can be pulled

Actual Result

Image with digest cannot be pulled

Job file (if appropriate)

job "plex-meta-manager" {
  datacenters = ["dc1"]

  reschedule {
    delay_function = "constant"
  }

  group "plex-meta-manager" {
    network {
      mode = "bridge"
    }

    task "run" {
      driver = "docker"

      config {
        image = "meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787"
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Jan 11 12:59:29 foobar nomad[862]:     2022-01-11T12:59:29.317-0800 [DEBUG] client.driver_mgr.docker: docker pull succeeded: driver=docker image_ref=meisnate12/plex-meta-manager:develop
Jan 11 12:59:29 foobar nomad[862]:     2022-01-11T12:59:29.317-0800 [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
error="no such image"
Jan 11 12:59:29 foobar nomad[862]:     2022-01-11T12:59:29.318-0800 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=d7d9f6c3-2d9b-6529-f544-1cf6e3466f61 task=run error="no such image"
Jan 11 12:59:29 foobar nomad[862]:     2022-01-11T12:59:29.318-0800 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=d7d9f6c3-2d9b-6529-f544-1cf6e3466f61 task=run reason="Error was unrecoverable"
Jan 11 12:59:29 foobar nomad[862]:     2022-01-11T12:59:29.318-0800 [DEBUG] client.alloc_runner: task failure, destroying all tasks: alloc_id=d7d9f6c3-2d9b-6529-f544-1cf6e3466f61 failed_task=run
tgross commented 2 years ago

Hi @axsuul! It looks like this error is bubbling up from coordinator.go#L215, which suggests that the image is getting pulled just fine but then when we inspect the image that we've just pulled, that's when it's failing. I'll see if I can reproduce this and come back with a workaround if I can find one.

axsuul commented 2 years ago

Thanks for looking into this @tgross !

axsuul commented 2 years ago

Still seeing this in 1.2.5

tgross commented 2 years ago

@axsuul yeah we haven't had a chance to address it yet. We'll update here once we have more info.

tgross commented 2 years ago

Hi @axsuul! I chased this down and the tl;dr is that the sha you're passing doesn't match the tag, so it's failing as intended!

I reproduced as follows. If we pull the image via the Docker CLI, we get the expected checksum:

$ docker pull meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
docker.io/meisnate12/plex-meta-manager@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787: Pulling from meisnate12/plex-meta-manager
...
Digest: sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
Status: Downloaded newer image for meisnate12/plex-meta-manager@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787
docker.io/meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787

$ docker images --digests
REPOSITORY                             TAG       DIGEST                                                                    IMAGE ID       CREATED       SIZE
meisnate12/plex-meta-manager           <none>    sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787   15652afd5f63   4 weeks ago   487MB

But notice the tag develop is no longer pointing to this checksum!

Next let's remove that image with docker rmi 15652afd5f63 and run our job. the client logs look like the following, which matches what you saw:

2022-02-09T19:30:27.758Z [DEBUG] client.driver_mgr.docker: image pull progress: driver=docker image_name=meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787 message="Pulled 9/9 (179.8MiB/179.8MiB) layers: 0 waiting/0 pulling" 2022-02-09T19:30:31.110Z [DEBUG] client.driver_mgr.docker: docker pull succeeded: driver=docker image_ref=meisnate12/plex-meta-manager:develop 2022-02-09T19:30:31.111Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=meisnate12/plex-meta-manager:develop@sha256:92b2e310db7d7fdfaf441574031d06361e97de64166a374517b5ece5b4eec787 error="no such image" 2022-02-09T19:30:31.111Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=0714c356-4aed-5fc1-6310-fb23f895cb29 task=run error="no such image"

But an image was pulled with the develop tag, it just doesn't match the checksum!:

$ docker images --digests
REPOSITORY                             TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
meisnate12/plex-meta-manager           develop   sha256:bbd6e585fd7a15fe04dd37ec110c763904b8455f09621337ce65a4c0cc44377f   1b2b1a929215   21 hours ago   488MB

When I went digging through the parseDockerImage function that reads the tag and sha from the image field, I discovered this an intentional behavior of the Docker client library we're using. I'm inclined to agree with their choice rather than what the Docker CLI does, as I would expect the checksum to match the tag!

See also image_test.go#L500-L535 if you want to know more about their rationale behind that, but they're unlikely to change it now because of backwards compatibility.

So while I think might be worth us updating the image field docs to make this behavior clear, this is Nomad doing the right thing and protecting you from using a tag that doesn't match the checksum you gave it.

axsuul commented 2 years ago

Hey @tgross really appreciate you looking into this!

Your reply makes sense but I guess I'm now a bit confused about Docker digests and how they work. Note that how I am referencing these digests worked fine in Kubernetes, only Nomad has been giving me issues about this.

For example, taking the postgres image, and using the latest tag with its current digest for amd64

docker pull postgres:latest@sha256:47bc053e5d09cd5d2b3ff17a1e2142f7051a6d40f9f8028b411f88bf7f973265
docker images --digests | grep postgres

returns with <none>

postgres     <none>     sha256:47bc053e5d09cd5d2b3ff17a1e2142f7051a6d40f9f8028b411f88bf7f973265

Does that mean the tag latest is not pointing to this checksum? From Docker Hub it does seem to be that the checksum is attached to that tag, or am I misreading here?

CleanShot 2022-02-11 at 17 44 57

axsuul commented 2 years ago

Hmm okay, what makes this even more odd is this image pulls fine with a digest

job "librespeed" {
  datacenters = ["dc1"]

  reschedule {
    delay_function = "constant"
  }

  group "librespeed" {
    network {
      mode = "bridge"
    }

    task "run" {
      driver = "docker"

      config {
        image = "adolfintel/speedtest:latest@sha256:df648683e8e03223254b3f4955976b93a328a5773f3ca8531e84afb7a55c8a01"
      }
    }
  }
}

So something seems to be off with the meisnate12/plex-meta-manager image in particular

tgross commented 2 years ago

Your reply makes sense but I guess I'm now a bit confused about Docker digests and how they work. Note that how I am referencing these digests worked fine in Kubernetes, only Nomad has been giving me issues about this.

The behavior is a little odd, and likely stems from (1) checksums being added late in the Docker API, and (2) tags being mutable vs checksums being immutable. If you take a look at the Register API, the manifest request takes a reference parameter, which can be a tag or a checksum. So when either Docker or Nomad is given an image like repo:tag@sha256:checksum, it effectively throws away the tag and uses the checksum to get the manifest.

In Docker, they must be throwing away the tag entirely which is why you're seeing <none> in the output. But in Nomad when we InspectImage we're using the whole image ID that was passed, instead of the one that's parsed. And presumably K8s is doing what Docker does.

So while we're doing the "technically correct" thing here in assuming you really meant the tag + sha, it diverges noticeably from how Docker does it. I'm going to bring this back to the rest of the Nomad team and see what they think about whether we should treat this divergence as a bug.

tgross commented 2 years ago

We had a little chat internally and we're going to treat Nomad's behavior as a bug. We'd probably like to catch this at job validation time, but anything in the config block only gets checked on the client so we wouldn't be able to surface that nicely to the user. And we can make the change without causing breaking anyone's existing jobspecs.

To close out this bug we'll: use the parseImage logic to detect when we have both a tag and a sha, and use only the sha or the tag when making the subsequent InspectImage call. I've marked this for roadmapping.

axsuul commented 1 year ago

Thanks, note I'm still having seeing this bug in 1.4.3

aucampia commented 8 months ago

still seeing this with 1.5.9:

2024-01-06T18:00:06.237Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=alpine:latest@sha256:eece025e432126ce23f223450a0326fbebde39cdf496a85d8c016293fc851978 error="no such image"
2024-01-06T18:00:06.237Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=8087424f-84e7-5497-bbb5-e90519a0a195 task=backup type="Driver Failure" msg="no such image" failed=false
2024-01-06T18:00:06.242Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=8087424f-84e7-5497-bbb5-e90519a0a195 task=backup error="no such image"
2024-01-06T18:00:23.322Z [ERROR] client.driver_mgr.docker: failed getting image id: driver=docker image_name=alpine:latest@sha256:eece025e432126ce23f223450a0326fbebde39cdf496a85d8c016293fc851978 error="no such image"
2024-01-06T18:00:23.323Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=051ff74c-19c1-d4da-42d4-f4aca81a722b task=backup type="Driver Failure" msg="no such image" failed=false
2024-01-06T18:00:23.325Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=051ff74c-19c1-d4da-42d4-f4aca81a722b task=backup error="no such image"