spinkube / spin-operator

Spin Operator is a Kubernetes operator that empowers platform engineers to deploy Spin applications as custom resources to their Kubernetes clusters
https://www.spinkube.dev/docs/overview/
Other
168 stars 22 forks source link

SpinApps don't work with images from ghcr.io #22

Closed macolso closed 7 months ago

macolso commented 8 months ago

Apply the following SpinApp to the cluster

apiVersion: spinoperator.dev/v1
kind: SpinApp
metadata:
  name: simple-spinapp
spec:
  image: "ghcr.io/fermyon/spin-operator/hello-world:latest"
  replicas: 1

Note: You'll have to make the image public in our packages repo or publish your own image to your own account e.g.

spin build
spin registry push ghrc.io/username/app-name:latest

Observe that the SpinApp fails with ErrImgPull. The events in the pod:

│   Type     Reason     Age                    From               Message                                                           │
│   ----     ------     ----                   ----               -------                                                           │
│   Normal   Scheduled  4m13s                  default-scheduler  Successfully assigned default/simple-spinapp-7b7d69df56-c2xnn to  │
│ k3d-wasm-cluster-server-0                                                                                                         │
│   Normal   Pulling    2m41s (x4 over 4m14s)  kubelet            Pulling image "ghcr.io/fermyon/spin-operator/hello-world:latest"  │
│   Warning  Failed     2m41s (x4 over 4m13s)  kubelet            Failed to pull image "ghcr.io/fermyon/spin-operator/hello-world:l │
│ atest": rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/fermyon/spin-operator/hello-world:latest": fail │
│ ed to unpack image on snapshotter overlayfs: mismatched image rootfs and manifest layers                                          │
│   Warning  Failed     2m41s (x4 over 4m13s)  kubelet            Error: ErrImagePull                                               │
│   Warning  Failed     2m28s (x6 over 4m12s)  kubelet            Error: ImagePullBackOff                                           │
│   Normal   BackOff    2m15s (x7 over 4m12s)  kubelet            Back-off pulling image "ghcr.io/fermyon/spin-operator/hello-world │
│ :latest"
macolso commented 8 months ago

Previous Discussion

@lann

Possibly related (same words!): https://github.com/deislabs/containerd-wasm-shims/issues/191

@endocrimes

spin registry push being incompatible with older containerd releases does appear to be the main culprit - I guess for now we should adopt an internal policy of building release artifacts primarily with docker? - and switch to spin registry push when newer containerd is more prevalent.

@lann

We depend pretty heavily on some docker-incompatible (I assume?) features at this point. @vdice

@vdice

There seem to be multiple things going on...

  1. spin registry push packages a Spin app into its locked app config, 1 or more wasm layers* (per wasm module/component) and 0 or more data layers** for any static assets... so the agent pulling/loading the app would need to minimally handle wasm layers to run the simplest Spin apps. But I haven't tried directly pushing w/ Docker and running in containerd -- maybe it does just work in some capacity?

  2. I didn't think deploying straight from a spin registry push'd Spin app even worked with the usual k3d image we use (ghcr.io/deislabs/containerd-wasm-shims/examples/k3d:v0.10.0) as that uses a k3s base img w/ a containerd version less than what James mentions in https://github.com/deislabs/containerd-wasm-shims/issues/191.*** But then @calebschoepp mentioned he had success -- albeit with the ttl.sh registry? (I'd be perplexed if chosen registry somehow had a part in this...)

  3. Last I checked, the current containerd shim engine doesn't yet handle the data or archive layers that we (spin's oci client) uses, or inlined data. I think we'd need to add support there if we want all types of Spin apps to run with the shim. Or maybe I am mistaken -- have we deployed more complex apps yet w/ the Spin Operator+shim combo? Eg a static site or some such? currently application/vnd.wasm.content.layer.v1+wasm; waiting for upstream to define canonical value or it may push archive layers if the total # of layers would exceed a max (500) and/or it may also write small content in-line into the config layer. These are both special cases that would need support in runtime engines. we've since bumped the k3s image in the shim, but that hasn't been included in a release yet... and I must admit I still haven't figured out how to build/produce the image locally 😂 I'll do some testing so I'm more equipped to compare notes...

@vdice

Some findings from testing today:

ttl.sh works?

I did reproduce getting a simple hello world app running when using the ttl.sh registry. I am still confounded on why this works... and you'll note that it hits the same error but then says 'image already present' (which is weird as it wasn't or shouldn't have been). Though it would only work sometimes: Working:

Normal   Scheduled  6s    default-scheduler  Successfully assigned default/simple-spinapp-86457f7d84-2vzbh to k3d-wasm-cluster-agent-0
Normal   Pulling    6s    kubelet            Pulling image "ttl.sh/hello:10m"
Warning  Failed     6s    kubelet            Failed to pull image "ttl.sh/hello:10m": rpc error: code = Unknown desc = failed to pull and unpack image "ttl.sh/hello:10m": failed to unpack image on snapshotter overlayfs: mismatched image rootfs and manifest layers
Warning  Failed     6s    kubelet            Error: ErrImagePull
Normal   Pulled     5s    kubelet            Container image "ttl.sh/hello:10m" already present on machine
Normal   Created    5s    kubelet            Created container simple-spinapp
Normal   Started    5s    kubelet            Started container simple-spinapp

Not working:

Normal   Scheduled  18s                default-scheduler  Successfully assigned default/simple-spinapp-86457f7d84-7x6xg to k3d-wasm-cluster-agent-1
Normal   Pulling    18s                kubelet            Pulling image "ttl.sh/hello:10m"
Warning  Failed     15s                kubelet            Failed to pull image "ttl.sh/hello:10m": rpc error: code = Unknown desc = failed to pull and unpack image "ttl.sh/hello:10m": failed to unpack image on snapshotter overlayfs: mismatched image rootfs and manifest layers
Warning  Failed     15s                kubelet            Error: ErrImagePull
Normal   Pulled     13s (x2 over 14s)  kubelet            Container image "ttl.sh/hello:10m" already present on machine
Normal   Created    13s (x2 over 14s)  kubelet            Created container simple-spinapp
Normal   Started    13s (x2 over 14s)  kubelet            Started container simple-spinapp
Warning  BackOff    12s                kubelet            Back-off restarting failed container simple-spinapp in pod simple-spinapp-86457f7d84-7x6xg_default(f5cc12ff-55a3-4e33-a824-4c201dd70454)

Attempting to use images from ghcr.io or docker.io lead to the same "failed to unpack image on snapshotter overlayfs: mismatched image rootfs and manifest layers" error and never worked, which as stated above, I believe is do to a containerd version < 1.7.7

need containerd 1.7.7+

I saw behavior similar to the above with all of: k3d:v0.10.0, minikube and kind clusters (default/latest). I tried the hack/provision-minikube.sh script but as far as I can tell that doesn't bump the containerd version. The latest kind image uses 1.7.5: https://github.com/kubernetes-sigs/kind/blob/main/images/base/Dockerfile#L121 and thus I couldn't get spin registry push'd apps running there either. Following the instructions to build a custom kind image, I built one w/ containerd rev'd to 1.7.12 and brought up a cluster with this image (you can too, the img is public): kind create cluster --image vdice/kind:latest. Tada! Running the hello world sample app from a ghcr.io ref works just fine:

Normal  Scheduled  7s    default-scheduler  Successfully assigned default/simple-spinapp-5f96f88d74-rnz2s to kind-control-plane
Normal  Pulling    7s    kubelet            Pulling image "ghcr.io/vdice/hello:latest"
Normal  Pulled     6s    kubelet            Successfully pulled image "ghcr.io/vdice/hello:latest" in 844ms (844ms including waiting)
Normal  Created    6s    kubelet            Created container simple-spinapp
Normal  Started    6s    kubelet            Started container simple-spinapp

still can't run apps w/ add'l non-wasm layers

As mentioned, the shim doesn't support the other layer types that may be included in a Spin app, for instance Finicky Whiskers, with its many static assets. The image is pulled/loaded fine but the app crash loops, presumably because of the missing/unloaded data layers

  Normal   Scheduled  15m                default-scheduler  Successfully assigned default/simple-> spinapp-865755598c-g72mv to kind-control-plane
  Normal   Pulled     15m                kubelet            Successfully pulled image "vdice/finicky-> whiskers:latest" in 9.307s (9.307s including waiting)
  Normal   Pulled     15m                kubelet            Successfully pulled image "vdice/finicky-?whiskers:latest" in 1.263s (1.263s including waiting)
Normal   Pulled     14m                kubelet            Successfully pulled image "vdice/finicky-whiskers:latest" in 1.526s (1.527s including waiting)
Normal   Created    14m (x4 over 15m)  kubelet            Created container simple-spinapp
Normal   Started    14m (x4 over 15m)  kubelet            Started container simple-spinapp
Normal   Pulled     14m                kubelet            Successfully pulled image "vdice/finicky-whiskers:latest" in 1.424s (1.424s including waiting)
Normal   Pulling    13m (x5 over 15m)  kubelet            Pulling image "vdice/finicky-whiskers:latest"
Warning  BackOff    3s (x68 over 15m)  kubelet            Back-off restarting failed container simple-spinapp in pod simple-spinapp-865755598c-g72mv_default(f773b310-071e-4898-a32a-0518f48c53a6)

Would definitely be curious to know if others have experiences different from mine.

contingency plans?

For Spin apps loaded directly from their spin registry push'd OCI references and using the shim:

  • A sufficient (1.7.7+ or 1.6.25+) containerd version is needed on the k8s cluster. I haven't even > attempted to survey the cloud offerings but even the latest local distros (k3d, minikube, kind) don't > appear to ship with sufficient versions. So how do we ensure success for users/customers with their > pre-existing k8s distros?
  • For full support of all Spin apps, it appears that we need to add logic to the shim to handle add'l types of layers that Spin may include in an app's OCI reference. We'd then need a new shim release and ensure that is the version being installed on user/customer k8s clusters.

@radu-matei

AKS on Ubuntu runs containerd 1.7.5 — https://github.com/Azure/AKS/blob/master/vhd-notes/aks-ubuntu/AKSUbuntu-2204/202401.03.0.txt#L4

@vdice

For those who would like to test a k3d image with containerd bumped to the min. required version to handle wasm layers, try ghcr.io/vdice/containerd-wasm-shims/examples/k3d:v0.10.1. (Basically just a snapshot of main of the project as of writing, including the https://github.com/deislabs/containerd-wasm-shims/pull/195).

@radu

After a quick search: AKS runs 1.7.5, and will support 1.7.7+ relatively soon (https://github.com/Azure/AKS/blob/master/vhd-notes/aks-ubuntu/AKSUbuntu-2204/202401.03.0.txt#L4) EKS currently runs 1.7.2 (https://github.com/awslabs/amazon-eks-ami/blob/main/CHANGELOG.md#L1157), but there is no progress on this or response from the EKS team (https://github.com/awslabs/amazon-eks-ami/issues/1526) for GKE I could not find the containerd version without creating a cluster (https://cloud.google.com/kubernetes-engine/docs/concepts/using-containerd) We need to chat with the EKS and GKE people to understand the timelines for a supported containerd version. Also, containerd 2 is coming up soon, and the feature we need is 7 patch versions ago (what we need is in 1.7.7, 1.7.13 was recently released). The shim should work on GKE out of the box. -- https://cloud.google.com/container-optimized-os/docs/release-notes/m109#cos-109-17800-66-54_ (containerd 1.7.10)

vdice commented 8 months ago

This should be resolved by https://github.com/spinkube/spin-operator/pull/48. Perhaps @calebschoepp (as original issue creator) can confirm and close?

endocrimes commented 8 months ago

Less fixed and more "our samples probably work if you use k3d" - we need to document the containerd version reqs and include an example of what to do if you're using older containerd.

bacongobbler commented 8 months ago

I'm going to throw this under "must have" as this seems like a fairly critical piece of documentation, and it should be pretty easy to add to our prerequisites page (if it isn't there already).

calebschoepp commented 7 months ago

@vdice using the new K3d version I was able to run an app with image pointing to ghcr.io. Agreed that this is a documentation issue at this point.

I suggest that we mark this issue as closed and file a new issue to track the work of documenting the work arounds. If I don't here any push back over the next day or two I'll go ahead and do that.

calebschoepp commented 7 months ago

Work is now tracked in #105