Closed calebschoepp closed 8 months ago
Ran into this today in testing again. This should probably be high in priority, though signs currently point to upstream culprit(s) (eg runwasi).
Some potential leads/related issues (mentioning because the app pod being stuck in 'Terminating' is reliably seen when encountering caching behavior): https://github.com/containerd/runwasi/issues/418 https://github.com/deislabs/containerd-wasm-shims/issues/207 (links to 418 above)
Put it as a must have in the Initial Release project. CC @macolso
Filled https://github.com/spinkube/containerd-shim-spin/issues/22 because this is a shim issue rather than an operator one.
same here, building with
spin registry login -u $AZURE_CONTAINER_REGISTRY_NAME -p $AZURE_CONTAINER_REGISTRY_PASSWORD $AZURE_CONTAINER_REGISTRY_ENDPOINT
spin registry push --build $AZURE_CONTAINER_REGISTRY_ENDPOINT/spin-dapr-ts:$REVISION
deploying with new image tag, checking
│ Name: distributor │
│ Namespace: default │
│ Labels: <none> │
│ Annotations: <none> │
│ API Version: core.spinoperator.dev/v1 │
│ Kind: SpinApp │
│ Metadata: │
│ Creation Timestamp: 2024-03-03T10:56:34Z │
│ Generation: 2 │
│ Resource Version: 19132 │
│ UID: 7f8a69ec-fceb-4527-a5e1-bf4b1fa9f533 │
│ Spec: │
│ Checks: │
│ Deployment Annotations: │
│ scheduler.alpha.kubernetes.io/node-selector: agentpool=wasm │
│ Enable Autoscaling: true │
│ Executor: containerd-shim-spin │
│ Image: kw971987ecacr.azurecr.io/spin-dapr-ts:1709464104
shows that latest / desired image is referenced; but logging indicates that changes are not active
I was also able to reproduce this behavior using latest bits. IMO the key to reproduce the issue is the following
hello-spin
here running app:0.0.1
)app:0.0.2
)hello-spin
app from the clusterapp:0.0.2
New instances spawned will response with whatever defined in app:0.0.1
my 2 cents here:
There may be two different issues in the play here:
spec.PublishNotReadyAddresses
. This will ensure traffic (via k8s service) is not sent to the terminating pods. @endocrimes, what do you think about the second point above? thanks
PublishNotReadyAddresses=true
would be a bug - we'd send traffic to pods that aren't ready to receive it (i.e still potentially compiling the app).
I was looking more into caching issue, and it seems like it is not due to publishNotReadyAddresses
(and we correctly don't configure it in spin-operator).
What I did noticed is that, the image in DeploymentSpec and PodSpec is referring to the new image, but in the containerStatutes
field in the pod spec, it is still referring to the old image.
I verified it was a new pod.
e.g. look at the pod spec from my test cluster: https://gist.github.com/rajatjindal/52ee4e796dc37d33f9b41f27dbc7baaf
I am still trying to understand why that might be happening.
We have managed to reproduce this setup consistently — once fetched by containerd, all Spin applications would have the same image ID, even if the SHA256 of the image was different (and the applications had different content):
# these images have very different Spin applications, yet they have the same image ID:
$ crictl images list
IMAGE TAG IMAGE ID SIZE
docker.io/rajatjindal/http-rust-cooler v1 aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-again willitwork aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-example f aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-example w aabedf0b09494 2.34MB
@rajatjindal tracked down this behavior in containerd, specifically where containerd resolves the local image based on the image ID — https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L163-L191, which would always return all Spin applications present on the node (since they all had the same image ID) — and the first of those would get picked up (https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L199), which resulted in the behavior we are seeing here — with pods with a mismatch between the container pod spec and the actual image reported in the status field (and run by containerd).
The issue boils down to how spin registry push
constructs the OCI reference for an application: https://github.com/fermyon/spin/blob/5b3aa0a0426d3c626bccdb3a1bfdd3c657952955/crates/oci/src/client.rs#L179-L187
Every Spin application ever pushed would generate the same OCI image config, because Spin would only set the OS and architecture — which resulted in all Spin applications having the same image ID reported by containerd — sha256:aabedf0b094944b344cecbffcfa0a9172e905a5d156bdf67c4c6cbb6d27d1d92
.
This was never an issue when running spin up -f <image-reference>
because the Spin internals do not rely on the OCI image config — just on the locked application file; however, containerd has no knowledge of it, and since the OCI image ID is the content digest of the config object, containerd reported that all Spin applications present on a node would have the same image ID.
The fix requires us to ensure the uniqueness of the OCI image config object for unique content, by adding a reference to the digest of the locked application file in the OCI config object — https://github.com/fermyon/spin/pull/2322.
With this fix in place, new Spin applications have different image IDs reported by containerd:
/ # crictl images list
IMAGE TAG IMAGE ID SIZE
docker.io/rajatjindal/http-rust-cooler v1 aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-again willitwork aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-example f aabedf0b09494 2.34MB
docker.io/rajatjindal/spin-rust-example w aabedf0b09494 2.34MB
# new images have new image IDs
ttl.sh/we-really-fixed-it 12h 0e91dc3974ef9 2.21MB
ttl.sh/we-really-fixed-it 24h 19740bab7ca70 2.21MB
The full fix requires a patch release of Spin (based on the Spin PR linked above), and no changes in the operator or the shim.
Huge thanks to @rajatjindal, @michelleN, @endocrimes, and @jpflueger for helping track this down.
Closing this because it's fixed in https://github.com/fermyon/spin/pull/2322 and has landed on Spin main.
Changing the image of a SpinApp CR is not always working. For an unknown reason sometimes when you change the image it does not update. Even if you delete and re-apply the SpinApp it still does not use the new image.
@vdice and @ThorstenHans have observed this issue. I'll leave it to them to share more details in a comment below. My understanding is that @ThorstenHans has a reproduction of this issue. Also please feel free to edit this issue to be more accurate.