Deployment went through even though image wasn't build/pushed properly

elderapo commented 6 months ago

What happened?

I recently upgraded @pulumi/docker from v3 to v4. Everything seemed okay until a couple of hours ago.

It appears that pulumi docker either failed to build the image (got stuck at some step?, I extensively use docker image multi-step builds), failed to push it to the docker registry but still successfully went through preview (buildOnPreview was set to true) and then deployed changes to the k8s cluster. What ended up happening is a broken deployment where some pods (not all) had their images set to <none>@<none>.

2024-02-29T23:24:02.0491260Z  ~  kubernetes:apps/v1:Deployment microservice-***** updating (73s) warning: [Pod *****--staging/microservice-*****-6c9485c5c6-hg84z]: containers with unready status: [microservice-*****] -- [InvalidImageName] Failed to apply default image tag "<none>@<none>": couldn't parse image reference "<none>@<none>": invalid reference format
...
2024-02-29T23:32:56.1344667Z  ~  kubernetes:apps/v1:Deployment microservice-***** updating (603s) error: 3 errors occurred:
2024-02-29T23:32:56.1346837Z  ~  kubernetes:apps/v1:Deployment microservice-***** **updating failed** error: 3 errors occurred:
2024-02-29T23:32:56.1348312Z @ Updating.......
2024-02-29T23:32:56.1349667Z     pulumi:pulumi:Stack *****-*****--staging running error: update failed
2024-02-29T23:32:56.1351425Z     pulumi:pulumi:Stack *****-*****--staging **failed** 1 error
2024-02-29T23:32:56.1352644Z Diagnostics:
2024-02-29T23:32:56.1353688Z   kubernetes:apps/v1:Deployment (microservice-*****):
2024-02-29T23:32:56.1354806Z     error: 3 errors occurred:
2024-02-29T23:32:56.1357230Z        * the Kubernetes API server reported that "*****--staging/microservice-*****" failed to fully initialize or become live: 'microservice-*****' timed out waiting to be Ready
2024-02-29T23:32:56.1360619Z        * Minimum number of Pods to consider the application live was not attained
2024-02-29T23:32:56.1364341Z        * [Pod *****--staging/microservice-*****-6c9485c5c6-hg84z]: containers with unready status: [microservice-*****] -- [InvalidImageName] Failed to apply default image tag "<none>@<none>": couldn't parse image reference "<none>@<none>": invalid reference format

Example

Unfortunately, I am unable to provide an example/reproduce the bug however there are snippets of how DockerProvider/Image instances were constructed:

const imageName = new docker.Image(
    options.name,
    {
      build: {
        context: MONOREPO_ROOT_DIRECTORY,
        dockerfile: join(MONOREPO_ROOT_DIRECTORY, "docker", `Dockerfile.production-${options.name}`),
        target: `${options.name}-release`,
        args: options.args,
        platform: "linux/amd64",
        builderVersion: "BuilderBuildKit",
      },
      imageName: `${dockerRegistryConfig.host}/${dockerRegistryConfig.username}/${options.name}`,
      buildOnPreview: true,
      registry: {
        server: dockerRegistryConfig.host,
        username: dockerRegistryConfig.username,
        password: dockerRegistryConfig.password,
      },
    },
    {
      provider: options.appEnvironment.dockerProvider,
    },
  ).repoDigest;

  const dockerProvider = new DockerProvider("docker-provider", {
    host: env.get("DOCKER_HOST").asString(),
    // prettier-ignore
    sshOpts: [
      "-o", "StrictHostKeyChecking=no",
      "-o", "UserKnownHostsFile=/dev/null",

      "-o", "ControlMaster=auto",
      "-o", "ControlPath=~/.ssh/control-%C",
      "-o", "ControlPersist=yes",
  ],
    registryAuth: [
      {
        address: dockerRegistryConfig.host,
        username: dockerRegistryConfig.username,
        password: dockerRegistryConfig.password,
      },
    ],
  });

Output of `pulumi about`

I am unable to get output of pulumi about because the issue in question occurred in github action (and I no longer access the VM instance).

Pulumi version was 3.107.0, installed through pulumi/actions@v5 (SHA:76683de37aa44910871ba6cef36557780f2e41d1) OS: Ubuntu 22.04

Additional context

I suspect the issue might've been caused by temporary network issues between "dedicated docker image builder server" and github action VM (where Pulumi is executed).

mjeffryes commented 6 months ago

Thanks for the bug report @elderapo. We'll keep an eye out for this to see if we can track it down. In the meantime, if you do find a consistent reproduction, please let us know!

alfred-stokespace commented 4 months ago

This is hitting me as well in GHE Workflow Action actions/pulumi-actions@v4

    "dependencies": {
        "@pulumi/aws": "^6.0.0",
        "@pulumi/awsx": "^2.0.2",
        "@pulumi/pulumi": "^3.113.0",
        "typescript": "^5.0.0"
    }

For me v4 tag is tagged this Jun 5, 2023 commit/4204b4e8a7e703da96ba5dd4c3a667adeee35812 which looks to be v4.4

In my case I have two new docker.Image(...) instances in the same stack, each building a different Dockerfile

I need the .repoDigest from both so I can do a follow up deploy.

But this fails intermittently as the .repoDigest is <none>@<none> which is not an acceptable input to my FargetTaskDefinition.

What's particularly odd, is that one of the new docker.Image(...) instances produces it's repoDigest correctly.

I had this happen earlier this week, but of the two images they flipped which one was <none>@<none>

The first time It happened I resorted to commenting out the declaration of the offending image, building, then, uncommenting out, then building again.

Just now, I tried...

local pulumi up
```
   Resources:
         4 unchanged
```
So that didn't help, pulumi stack output still has one of the two images as none@none
Next, I'll try removing the assignment of the repoDigest to the exported output
```
   Outputs:
      - ghApiExporterImage           : "<none>@<none>"
```
yes to that,

Now, if I add it back will I get the goods? Nope

   Outputs:
   + ghApiExporterImage           : "<none>@<none>"

So, now I guess the issue is not with the output but the resource? I will do what I did last time I guess and delete the object (which requires temporary refactor my code to allow return types to be undefined through a layer of contract/interface code (ie. it's a hassle)
K, now up deleted the resource and output.
Now I revert all the interface changes and revert the delete-of-the-docker-resource
I'm going to try rerunning this in Actions now rather then locally... see what that does...
It worked... I now have two proper digest urls and Fargate is happy again.

A couple thoughts... Obviously it'd be great if this just didn't break, but when it does break the fact that I have to change the code, run pulumi, revert the code, run pulumi makes for a real head ache.

I wonder, would a pinpoint state delete command, followed by a up, be the way to go here (until you fix it that is) ?

elderapo commented 4 months ago

@alfred-stokespace If you can reproduce it on demand, please create a simple repro repo. It should help the Pulumi team to get to the bottom of this issue. It only happened to me a couple of times around the time I opened this issue.

Meanwhile, I am using this trick to prevent accidental deployments when images don't build successfully:

const image = new docker.Image(...);

const validatedImage = image.repoDigest.apply(digest => {
  if (digest === "<none>@<none>") {
    throw new ResourceError(
      `Digest(${digest}) is "<none>@<none>"! Image either failed to build or push...`,
      image,
      true,
    );
  }

  /**
   * Possibilities:
   * sha256:xxx
   * docker.io/user/repo@sha256:xxx
   */
  if (!digest.includes("sha256:")) {
    throw new ResourceError(
      `Digest(${digest}) does not include sha256 prefix! Image either failed to build or push...`,
      image,
      true,
    );
  }

  return digest;
});

pulumi / pulumi-docker