Image in SpinApp CR is being incorrectly cached

calebschoepp commented 9 months ago

Changing the image of a SpinApp CR is not always working. For an unknown reason sometimes when you change the image it does not update. Even if you delete and re-apply the SpinApp it still does not use the new image.

@vdice and @ThorstenHans have observed this issue. I'll leave it to them to share more details in a comment below. My understanding is that @ThorstenHans has a reproduction of this issue. Also please feel free to edit this issue to be more accurate.

vdice commented 8 months ago

Ran into this today in testing again. This should probably be high in priority, though signs currently point to upstream culprit(s) (eg runwasi).

Some potential leads/related issues (mentioning because the app pod being stuck in 'Terminating' is reliably seen when encountering caching behavior): https://github.com/containerd/runwasi/issues/418 https://github.com/deislabs/containerd-wasm-shims/issues/207 (links to 418 above)

calebschoepp commented 8 months ago

Put it as a must have in the Initial Release project. CC @macolso

endocrimes commented 8 months ago

Filled https://github.com/spinkube/containerd-shim-spin/issues/22 because this is a shim issue rather than an operator one.

KaiWalter commented 8 months ago

same here, building with

 spin registry login -u $AZURE_CONTAINER_REGISTRY_NAME -p $AZURE_CONTAINER_REGISTRY_PASSWORD $AZURE_CONTAINER_REGISTRY_ENDPOINT
  spin registry push --build $AZURE_CONTAINER_REGISTRY_ENDPOINT/spin-dapr-ts:$REVISION

deploying with new image tag, checking

│ Name:         distributor                                                                                                                       │
│ Namespace:    default                                                                                                                           │
│ Labels:       <none>                                                                                                                            │
│ Annotations:  <none>                                                                                                                            │
│ API Version:  core.spinoperator.dev/v1                                                                                                          │
│ Kind:         SpinApp                                                                                                                           │
│ Metadata:                                                                                                                                       │
│   Creation Timestamp:  2024-03-03T10:56:34Z                                                                                                     │
│   Generation:          2                                                                                                                        │
│   Resource Version:    19132                                                                                                                    │
│   UID:                 7f8a69ec-fceb-4527-a5e1-bf4b1fa9f533                                                                                     │
│ Spec:                                                                                                                                           │
│   Checks:                                                                                                                                       │
│   Deployment Annotations:                                                                                                                       │
│     scheduler.alpha.kubernetes.io/node-selector:  agentpool=wasm                                                                                │
│   Enable Autoscaling:                             true                                                                                          │
│   Executor:                                       containerd-shim-spin                                                                          │
│   Image:                                          kw971987ecacr.azurecr.io/spin-dapr-ts:1709464104

shows that latest / desired image is referenced; but logging indicates that changes are not active

ThorstenHans commented 8 months ago

I was also able to reproduce this behavior using latest bits. IMO the key to reproduce the issue is the following

Deploy a Spin App (hello-spin here running app:0.0.1)
Make a change to source code
Push the new tag to your registry (app:0.0.2)
Delete the hello-spin app from the cluster
Update deployment manifest to use app:0.0.2
Deploy the app

New instances spawned will response with whatever defined in app:0.0.1

rajatjindal commented 8 months ago

my 2 cents here:

There may be two different issues in the play here:

The spin app pods are stuck in terminating state. Its been tracked in https://github.com/spinkube/containerd-shim-spin/issues/22
The terminating pods still continues to serve traffic. This behavior can be changed using Service spec.PublishNotReadyAddresses. This will ensure traffic (via k8s service) is not sent to the terminating pods.

@endocrimes, what do you think about the second point above? thanks

endocrimes commented 8 months ago

PublishNotReadyAddresses=true would be a bug - we'd send traffic to pods that aren't ready to receive it (i.e still potentially compiling the app).

rajatjindal commented 8 months ago

I was looking more into caching issue, and it seems like it is not due to publishNotReadyAddresses (and we correctly don't configure it in spin-operator).

What I did noticed is that, the image in DeploymentSpec and PodSpec is referring to the new image, but in the containerStatutes field in the pod spec, it is still referring to the old image.

I verified it was a new pod.

e.g. look at the pod spec from my test cluster: https://gist.github.com/rajatjindal/52ee4e796dc37d33f9b41f27dbc7baaf

I am still trying to understand why that might be happening.

radu-matei commented 8 months ago

We have managed to reproduce this setup consistently — once fetched by containerd, all Spin applications would have the same image ID, even if the SHA256 of the image was different (and the applications had different content):

# these images have very different Spin applications, yet they have the same image ID:
$ crictl images list
IMAGE                                        TAG                    IMAGE ID            SIZE
docker.io/rajatjindal/http-rust-cooler       v1                     aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-again        willitwork             aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      f                      aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      w                      aabedf0b09494       2.34MB

@rajatjindal tracked down this behavior in containerd, specifically where containerd resolves the local image based on the image ID — https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L163-L191, which would always return all Spin applications present on the node (since they all had the same image ID) — and the first of those would get picked up (https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L199), which resulted in the behavior we are seeing here — with pods with a mismatch between the container pod spec and the actual image reported in the status field (and run by containerd).

The issue boils down to how spin registry push constructs the OCI reference for an application: https://github.com/fermyon/spin/blob/5b3aa0a0426d3c626bccdb3a1bfdd3c657952955/crates/oci/src/client.rs#L179-L187

Every Spin application ever pushed would generate the same OCI image config, because Spin would only set the OS and architecture — which resulted in all Spin applications having the same image ID reported by containerd — sha256:aabedf0b094944b344cecbffcfa0a9172e905a5d156bdf67c4c6cbb6d27d1d92.

This was never an issue when running spin up -f <image-reference> because the Spin internals do not rely on the OCI image config — just on the locked application file; however, containerd has no knowledge of it, and since the OCI image ID is the content digest of the config object, containerd reported that all Spin applications present on a node would have the same image ID.

The fix requires us to ensure the uniqueness of the OCI image config object for unique content, by adding a reference to the digest of the locked application file in the OCI config object — https://github.com/fermyon/spin/pull/2322.

With this fix in place, new Spin applications have different image IDs reported by containerd:

/ # crictl images list
IMAGE                                        TAG                    IMAGE ID            SIZE
docker.io/rajatjindal/http-rust-cooler       v1                     aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-again        willitwork             aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      f                      aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      w                      aabedf0b09494       2.34MB
# new images have new image IDs
ttl.sh/we-really-fixed-it                    12h                    0e91dc3974ef9       2.21MB
ttl.sh/we-really-fixed-it                    24h                    19740bab7ca70       2.21MB

The full fix requires a patch release of Spin (based on the Spin PR linked above), and no changes in the operator or the shim.

Huge thanks to @rajatjindal, @michelleN, @endocrimes, and @jpflueger for helping track this down.

endocrimes commented 8 months ago

Closing this because it's fixed in https://github.com/fermyon/spin/pull/2322 and has landed on Spin main.

spinkube / spin-operator

Image in SpinApp CR is being incorrectly cached #40