kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
636 stars 561 forks source link

Unable to create cluster using amazon-2 ami #4434

Open MaxFedotov opened 1 year ago

MaxFedotov commented 1 year ago

/kind bug

What steps did you take and what happened: Create a cluster using capa-ami-amazon-2-v1.25.12 image. Control-plane node won't be started and the following error will be in control-plane kubelet logs:

Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104552    4449 remote_runtime.go:222] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104604    4449 kuberuntime_sandbox.go:71] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104631    4449 kuberuntime_manager.go:772] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal"
Aug 01 14:16:50 ip-10-189-0-251.eu-central-1.compute.internal kubelet[4449]: E0801 14:16:50.104708    4449 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"etcd-ip-10-189-0-251.eu-central-1.compute.internal_kube-system(a812507b09a2bba6c5690db77f322d9f)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"etcd-ip-10-189-0-251.eu-central-1.compute.internal_kube-system(a812507b09a2bba6c5690db77f322d9f)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/381d9c64be09430dd45c3ea4d33c6d7473d0704881ec3fc293d7e69fec81ac57/log.json: no such file or directory): runc did not terminate successfully: exit status 127: unknown\"" pod="kube-system/etcd-ip-10-189-0-251.eu-central-1.compute.internal" podUID=a812507b09a2bba6c5690db77f322d9f

If you will try to run runc binary, the following error will be returned:

root@ip-10-189-0-251 ~]# /usr/local/sbin/runc --help
/usr/local/sbin/runc: symbol lookup error: /usr/local/sbin/runc: undefined symbol: seccomp_notify_respond

This happens because CAPA is using cri-containerd-*.tar.gz archive to install containerd and runc. According to containerd release notes: https://github.com/containerd/containerd/blob/40f26543bdc27cbe8b058ac082e91c5832bb1c41/releases/v1.6.0.toml#L64-L76 runc, included in containerd distribution is built with dynamic linking to libseccomp.

CAPA is using the following version of containerd:

[root@ip-10-189-0-251 ~]# /usr/local/bin/containerd --version
containerd github.com/containerd/containerd v1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8

which according to release notes includes runc v1.1.7.

runc v1.1.7 is linked to libseccomp-2.5.4, but installed version is

[root@ip-10-189-0-251 ~]# yum list installed | grep libsec
libseccomp.x86_64                     2.4.1-1.amzn2                  installed

which is the maximum libseccomp version available for epel7 repo.

What did you expect to happen: User should be able to create cluster using amazon linux 2 images.

Anything else you would like to add: I was able to fix this issue in my image-builder fork by adding ansible steps to manually download statically-linked runc from https://github.com/opencontainers/runc/releases and replace runc installed by cri-containerd-*.tar.gz archive. I can create a pull request in image-builder repo with the fix if you are ok with this approach.

Environment:

Skarlso commented 1 year ago

/triage accepted

Ankitasw commented 1 year ago

@MaxFedotov were you able to open the PR to fix this in image builder?

MaxFedotov commented 1 year ago

@Ankitasw yes, will do it on the next week

Ankitasw commented 1 year ago

thankyou @MaxFedotov 🙂

k8s-triage-robot commented 2 weeks ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted