k3s-io / helm-controller

Apache License 2.0
391 stars 85 forks source link

Pods in CrashLoopBackOff on K3s Rootless Installation #250

Closed nishantmunjal7 closed 2 months ago

nishantmunjal7 commented 2 months ago

We are attempting a K3s rootless installation on an airgapped system. While some pods are running as expected, others are encountering issues.

Running Pods:

Pods in CrashLoopBackOff:

Image:

rancher/klipper-helm:v0.8.4-build20240523

Screenshot 2024-08-29 at 2 53 38 PM

Here are the logs


I0828 16:24:06.157192      58 job_controller.go:566] "enqueueing job" logger="job-controller" key="kube-system/helm-install-traefik"
I0828 16:24:06.175560      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/metrics-server-557ff575fb" duration="35.309µs"
I0828 16:24:06.176535      58 controller.go:615] quota admission added evaluator for: endpoints
I0828 16:24:06.176804      58 controller.go:615] quota admission added evaluator for: endpointslices.discovery.k8s.io
I0828 16:24:06.184527      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/coredns-576bfc4dc7" duration="17.70222ms"
I0828 16:24:06.185031      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/coredns-576bfc4dc7" duration="36.802µs"
I0828 16:24:06.187788      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/local-path-provisioner-6795b5f9d8" duration="4.458572ms"
I0828 16:24:06.188088      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/local-path-provisioner-6795b5f9d8" duration="39.908µs"
I0828 16:24:06.227649      58 desired_state_of_world_populator.go:157] "Finished populating initial desired state of world"
E0828 16:24:06.449816      58 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"17184b3887fe5b2a8e412857f3de2a2e01b10cb2cc5e80cefd23efc01dfac9ae\": not found" podSandboxID="17184b3887fe5b2a8e412857f3de2a2e01b10cb2cc5e80cefd23efc01dfac9ae"
E0828 16:24:06.449880      58 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"2755fc9f4e385184a94d6f94e18f4ff5d13bc340f7579d54740187bf3a3f4961\": not found" podSandboxID="2755fc9f4e385184a94d6f94e18f4ff5d13bc340f7579d54740187bf3a3f4961"
E0828 16:24:06.780751      58 remote_runtime.go:343] "StartContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"entry\": executable file not found in $PATH: unknown" containerID="5cf7f71da9bba1dcac1bf65cc52e8324bf682bacebfb379a1ff52c6d3b3429a7"
E0828 16:24:06.780934      58 kuberuntime_manager.go:1256] container &Container{Name:helm,Image:rancher/klipper-helm:v0.8.4-build20240523,Command:[],Args:[install],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:NAME,Value:traefik-crd,ValueFrom:nil,},EnvVar{Name:VERSION,Value:,ValueFrom:nil,},EnvVar{Name:REPO,Value:,ValueFrom:nil,},EnvVar{Name:HELM_DRIVER,Value:secret,ValueFrom:nil,},EnvVar{Name:CHART_NAMESPACE,Value:kube-system,ValueFrom:nil,},EnvVar{Name:CHART,Value:https://%{KUBERNETES_API}%/static/charts/traefik-crd-25.0.3+up25.0.0.tgz,ValueFrom:nil,},EnvVar{Name:HELM_VERSION,Value:,ValueFrom:nil,},EnvVar{Name:TARGET_NAMESPACE,Value:kube-system,ValueFrom:nil,},EnvVar{Name:AUTH_PASS_CREDENTIALS,Value:false,ValueFrom:nil,},EnvVar{Name:NO_PROXY,Value:.svc,.cluster.local,10.42.0.0/16,10.43.0.0/16,ValueFrom:nil,},EnvVar{Name:FAILURE_POLICY,Value:reinstall,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:klipper-helm,ReadOnly:false,MountPath:/home/klipper-helm/.helm,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:klipper-cache,ReadOnly:false,MountPath:/home/klipper-helm/.cache,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:klipper-config,ReadOnly:false,MountPath:/home/klipper-helm/.config,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:tmp,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:values,ReadOnly:false,MountPath:/config,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:content,ReadOnly:false,MountPath:/chart,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-n6h7q,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[ALL],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod helm-install-traefik-crd-lmxt7_kube-system(5d6b2a42-6792-4587-a5b9-5195a7b1fd07): RunContainerError: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "entry": executable file not found in $PATH: unknown
E0828 16:24:06.780964      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with RunContainerError: \"failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \\\"entry\\\": executable file not found in $PATH: unknown\"" pod="kube-system/helm-install-traefik-crd-lmxt7" podUID="5d6b2a42-6792-4587-a5b9-5195a7b1fd07"
E0828 16:24:06.802402      58 remote_runtime.go:343] "StartContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"entry\": executable file not found in $PATH: unknown" containerID="c2531d0f615b9e2e7f7103f9593f5489465dc2f3e23ad89b5a21e7c733fc54e1"
E0828 16:24:06.802582      58 kuberuntime_manager.go:1256] container &Container{Name:helm,Image:rancher/klipper-helm:v0.8.4-build20240523,Command:[],Args:[install --set-string global.systemDefaultRegistry=],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:NAME,Value:traefik,ValueFrom:nil,},EnvVar{Name:VERSION,Value:,ValueFrom:nil,},EnvVar{Name:REPO,Value:,ValueFrom:nil,},EnvVar{Name:HELM_DRIVER,Value:secret,ValueFrom:nil,},EnvVar{Name:CHART_NAMESPACE,Value:kube-system,ValueFrom:nil,},EnvVar{Name:CHART,Value:https://%{KUBERNETES_API}%/static/charts/traefik-25.0.3+up25.0.0.tgz,ValueFrom:nil,},EnvVar{Name:HELM_VERSION,Value:,ValueFrom:nil,},EnvVar{Name:TARGET_NAMESPACE,Value:kube-system,ValueFrom:nil,},EnvVar{Name:AUTH_PASS_CREDENTIALS,Value:false,ValueFrom:nil,},EnvVar{Name:NO_PROXY,Value:.svc,.cluster.local,10.42.0.0/16,10.43.0.0/16,ValueFrom:nil,},EnvVar{Name:FAILURE_POLICY,Value:reinstall,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:klipper-helm,ReadOnly:false,MountPath:/home/klipper-helm/.helm,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:klipper-cache,ReadOnly:false,MountPath:/home/klipper-helm/.cache,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:klipper-config,ReadOnly:false,MountPath:/home/klipper-helm/.config,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:tmp,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:values,ReadOnly:false,MountPath:/config,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:content,ReadOnly:false,MountPath:/chart,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-kvhg6,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[ALL],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod helm-install-traefik-6ts7p_kube-system(71e1c08a-de9f-43fc-8473-494278eccecc): RunContainerError: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "entry": executable file not found in $PATH: unknown
E0828 16:24:06.802623      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with RunContainerError: \"failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \\\"entry\\\": executable file not found in $PATH: unknown\"" pod="kube-system/helm-install-traefik-6ts7p" podUID="71e1c08a-de9f-43fc-8473-494278eccecc"
I0828 16:24:07.168306      58 scope.go:117] "RemoveContainer" containerID="40ee059bb7de2d2d0cc8eaaeb8c9e55c30245b1944d504f36ff4dad03ac8cf25"
I0828 16:24:07.168622      58 scope.go:117] "RemoveContainer" containerID="5cf7f71da9bba1dcac1bf65cc52e8324bf682bacebfb379a1ff52c6d3b3429a7"
E0828 16:24:07.168878      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with CrashLoopBackOff: \"back-off 10s restarting failed container=helm pod=helm-install-traefik-crd-lmxt7_kube-system(5d6b2a42-6792-4587-a5b9-5195a7b1fd07)\"" pod="kube-system/helm-install-traefik-crd-lmxt7" podUID="5d6b2a42-6792-4587-a5b9-5195a7b1fd07"
I0828 16:24:07.177818      58 scope.go:117] "RemoveContainer" containerID="d6a9244a0b34ecd2b6e8d6214410621bab97f2d28cdcd641d3ae2ada465fc51a"
I0828 16:24:07.177980      58 job_controller.go:566] "enqueueing job" logger="job-controller" key="kube-system/helm-install-traefik-crd"
I0828 16:24:07.178136      58 scope.go:117] "RemoveContainer" containerID="c2531d0f615b9e2e7f7103f9593f5489465dc2f3e23ad89b5a21e7c733fc54e1"
E0828 16:24:07.178339      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with CrashLoopBackOff: \"back-off 10s restarting failed container=helm pod=helm-install-traefik-6ts7p_kube-system(71e1c08a-de9f-43fc-8473-494278eccecc)\"" pod="kube-system/helm-install-traefik-6ts7p" podUID="71e1c08a-de9f-43fc-8473-494278eccecc"
I0828 16:24:07.187378      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/coredns-576bfc4dc7" duration="41.669µs"
I0828 16:24:07.201065      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/metrics-server-557ff575fb" duration="71.663µs"
I0828 16:24:07.213497      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/local-path-provisioner-6795b5f9d8" duration="5.882066ms"
I0828 16:24:07.213802      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/local-path-provisioner-6795b5f9d8" duration="41.36µs"
I0828 16:24:07.218971      58 job_controller.go:566] "enqueueing job" logger="job-controller" key="kube-system/helm-install-traefik"
I0828 16:24:08.183446      58 scope.go:117] "RemoveContainer" containerID="47347725d9b4e2e866b57193690296184389000ad1c69c25742b5c7be535c787"
I0828 16:24:08.183704      58 scope.go:117] "RemoveContainer" containerID="d3dd8dd932273351f0c42cd554949e9559b3c58b60299bf3bf95945482bc2365"
E0828 16:24:08.183978      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"metrics-server\" with CrashLoopBackOff: \"back-off 10s restarting failed container=metrics-server pod=metrics-server-557ff575fb-hnbsx_kube-system(55435c35-a141-41e2-96a4-7bf8f23c4ff1)\"" pod="kube-system/metrics-server-557ff575fb-hnbsx" podUID="55435c35-a141-41e2-96a4-7bf8f23c4ff1"
I0828 16:24:08.195546      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/metrics-server-557ff575fb" duration="40.601µs"
I0828 16:24:08.351143      58 prober_manager.go:312] "Failed to trigger a manual run" probe="Readiness"
I0828 16:24:08.383832      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/coredns-576bfc4dc7" duration="6.593681ms"
I0828 16:24:08.384083      58 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="kube-system/coredns-576bfc4dc7" duration="63.174µs"
I0828 16:24:09.191758      58 scope.go:117] "RemoveContainer" containerID="d3dd8dd932273351f0c42cd554949e9559b3c58b60299bf3bf95945482bc2365"
E0828 16:24:09.192020      58 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"metrics-server\" with CrashLoopBackOff: \"back-off 10s restarting failed container=metrics-server pod=metrics-server-557ff575fb-hnbsx_kube-system(55435c35-a141-41e2-96a4-7bf8f23c4ff1)\"" pod="kube-system/metrics-server-557ff575fb-hnbsx" podUID="55435c35-a141-41e2-96a4-7bf8f23c4ff1"```
brandond commented 2 months ago

Are you sure you mirrored the images properly? What are you using as the image snapshotter? The message from containerd indicates that content (specifically the entry point executable) is missing from the images.

nishantmunjal7 commented 2 months ago

Yes, as per the logs We are putting the airgapped images tar inside ~/.rancher/k3s/agent/images/

I0828 16:21:51.115381      54 garbagecollector.go:157] "All resource monitors have synced. Proceeding to collect garbage" logger="garbage-collector-controller"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/klipper-helm:v0.8.4-build20240523"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/klipper-lb:v0.4.9"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/local-path-provisioner:v0.0.28"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/mirrored-coredns-coredns:1.10.1"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/mirrored-library-busybox:1.36.1"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/mirrored-library-traefik:2.10.7"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/mirrored-metrics-server:v0.7.0"
time="2024-08-28T16:21:52Z" level=info msg="Imported docker.io/rancher/mirrored-pause:3.6"
time="2024-08-28T16:21:52Z" level=info msg="Imported images from /home/atlanedit/.rancher/k3s/agent/images/k3s-airgap-images-amd64.tar.zst in 12.731775886s"

And its using fuse-overlayfs as snapshotter

k3s crictl info

"config": {
    "containerd": {
      "snapshotter": "fuse-overlayfs",
      "defaultRuntimeName": "runc",
      "defaultRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": null,
        "ContainerAnnotations": null,
        "runtimeRoot": "",
        "options": null,
        "privileged_without_host_devices": false,
        "privileged_without_host_devices_all_devices_allowed": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0,
        "snapshotter": "",
        "sandboxMode": ""
      }

Also, adding more context - This entire setup has run very well when we ran it on ec2 machine with a non-root user in an air-gapped system, but we are facing this with one of the deployment we are doing on a more-restrictive VM environment.

brandond commented 2 months ago

OK so what is the difference between the EC2 environment, and this one? I suspect that something is going on with the host, I am not aware of anything on the K3s side that would cause content to be lost from images.

nishantmunjal7 commented 2 months ago

The restricted environment has IPv6 blocked and dnsmasq enabled, but I don't think either of these is causing the issue.

I'm trying to diagnose the issue and understand more about it. The logs show: Imported docker.io/rancher/klipper-helm:v0.8.4-build20240523. Could the k3s setup be pointing to this image (klipper-helm.8.4-build20240523) that might be missing content, specifically the entry point executable? If so, do you mean something on the host be modifying this image?

Additionally, are there other steps or checks that could help with debugging this issue?

brandond commented 2 months ago

did you figure out what the problem was?