spinkube / azure

All things Azure related in the SpinKube project
Apache License 2.0
0 stars 1 forks source link

Restart Kubernetes from Azure Portal, then SpinApp couldn't run anymore #24

Open thangchung opened 1 month ago

thangchung commented 1 month ago

I followed the guidance in the README file. It worked very well.

However, one issue that has happened to me is that if I stop the AKS cluster and restart it again, SpinApp (deployment) will be in spending status forever. See below

image

The logs: 104s Normal Scheduled pod/simple-spinapp-84c9b4885b-bf682 Successfully assigned default/simple-spinapp-84c9b4885b-bf682 to aks-nodepool1-18815957-vmss000001 12s Warning FailedCreatePodSandBox pod/simple-spinapp-84c9b4885b-bf682 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "spin" is configured

I tried to delete it by using:

kubectl delete -f https://raw.githubusercontent.com/spinkube/spin-operator/main/config/samples/simple.yaml

And

kubectl apply -f https://raw.githubusercontent.com/spinkube/spin-operator/main/config/samples/simple.yaml

It was still not working.

The only way to make it work again is to use helm delete spinkube, and re-install it again on the AKS cluster.

Mossaka commented 1 month ago

interesting, were you able to ssh into the cluster node and check if the spin shim binary still exists in PATH or the contaienrd's config.toml still have the CRI config for the spin shim?

vdice commented 1 month ago

I'm seeing the same behavior. Indeed, when the (new?) node(s) come back up after the AKS stop/restart, they are missing the spin shim CRI config -- thus the SpinApp pods are stuck in ContainerCreating with failed to get sandbox runtime: no runtime for "spin" is configured.

The current quick fix is to re-annotate node(s), eg via kubectl annotate node --all kwasm.sh/kwasm-node=true. (Should not need to delete spinkube and re-install.) But the best resolution would be for AKS to preserve the containerd configuration through the stop/restart cycle.

Mossaka commented 1 month ago

I will reach out to the AKS team to find out the configuration issue

thangchung commented 1 month ago

I will reach out to the AKS team to find out the configuration issue

Thanks, @Mossaka @vdice for acting on it. I'm waiting for https://github.com/spinkube/azure/issues/25.

ThorstenHans commented 2 weeks ago

This issue also affects Kubernetes clusters outside of Azure that have capabilities like horizontal cluster auto-scaling or scheduled node upgrades.

As an intermediate solution, I created a small DaemonSet that starts a Job to annotate the current Kubernetes node.

Although the solution isn't ideal, It guarantees that new nodes will be annotated with kwasm.sh/kwasm-node=true.

@Mossaka I'm happy to polish my workaround and publish it on GitHub so that others will have a solution for this.