k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.82k stars 2.33k forks source link

Upgrading from 1.21 to 1.22 with a multi-server cluster causes crash on node that is upgrading #3994

Closed rancher-max closed 3 years ago

rancher-max commented 3 years ago

I had a cluster using an external sql db with 2 servers, 1 agent, all on v1.21.1+k3s1. I performed a manual (curl) upgrade to master commit 90960ebf4e7a08df075526f411e3afe06731b01e on one server. This caused a crash loop in the k3s logs with:

Sep 10 19:40:44 ip-172-31-9-91 k3s[8526]: I0910 19:40:44.285820    8526 shared_informer.go:240] Waiting for caches to sync for stateful set
Sep 10 19:40:44 ip-172-31-9-91 k3s[8526]: W0910 19:40:44.359932    8526 garbagecollector.go:703] failed to discover some groups: map[admissionregistration.k8s.io/v1beta1:the server could not find the requested resource apiextensions.k8s.io/v1beta1:the server could not find the requested resource authentication.k8s.io/v1beta1:the server could not find the requested resource authorization.k8s.io/v1beta1:the server could not find the requested resource certificates.k8s.io/v1beta1:the server could not find the requested resource coordination.k8s.io/v1beta1:the server could not find the requested resource extensions/v1beta1:the server could not find the requested resource networking.k8s.io/v1beta1:the server could not find the requested resource rbac.authorization.k8s.io/v1beta1:the server could not find the requested resource scheduling.k8s.io/v1beta1:the server could not find the requested resource]
Sep 10 19:40:44 ip-172-31-9-91 k3s[8526]: I0910 19:40:44.386319    8526 node_ipam_controller.go:91] Sending events to api server.
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: I0910 19:40:45.255834    8526 leaderelection.go:283] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: F0910 19:40:45.255874    8526 controllermanager.go:234] leaderelection lost
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: goroutine 13459 [running]:
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.stacks(0xc000122001, 0xc0098cb8c0, 0x4c, 0xa4)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).output(0x80c4260, 0xc000000003, 0x0, 0x0, 0xc0013a2af0, 0x0, 0x696c4e0, 0x14, 0xea, 0x0)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).printf(0x80c4260, 0x3, 0x0, 0x0, 0x0, 0x0, 0x4ee47b5, 0x13, 0x0, 0x0, ...)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.Fatalf(...)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1514
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app.Run.func3()
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:234 +0x8f
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc00f522d80)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x29
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc00f522d80, 0x58d7450, 0xc0170fe540)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x167
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x58d7450, 0xc00011e018, 0x58f6ff0, 0xc01231e780, 0x37e11d600, 0x2540be400, 0x77359400, 0xc0141f0280, 0x526d0a8, 0x0, ...)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x9f
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app.leaderElectAndRun(0xc00cbf42e0, 0xc003850280, 0x33, 0xc01400dd70, 0x4eadde9, 0x6, 0x4efd546, 0x18, 0xc0141f0280, 0x526d0a8, ...)
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:465 +0x345
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: created by github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app.Run
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]:         /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:217 +0x8cb
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: goroutine 1 [chan receive]:
Sep 10 19:40:45 ip-172-31-9-91 k3s[8526]: github.com/rancher/k3s/pkg/agent.run(0x58d7418, 0xc0012c2d80, 0xc009249a40, 0x4e, 0x0, 0x0, 0x0, 0x0, 0xc0077afba8, 0x16, ...)

After upgrading the second server, this crash went away and both servers were running successfully. https://github.com/k3s-io/k3s/pull/3993

rancher-max commented 3 years ago

In this case now I don't see panics in the k3s logs, but I do have crashloopbackoffs on traefik installs:

NAMESPACE     NAME                                         READY   STATUS             RESTARTS       AGE     IP          NODE               NOMINATED NODE   READINESS GATES
kube-system   pod/svclb-traefik-xls49                      2/2     Running            0              21m     10.42.0.8   ip-172-31-13-221   <none>           <none>
kube-system   pod/svclb-traefik-c97vg                      2/2     Running            0              20m     10.42.1.2   ip-172-31-1-73     <none>           <none>
kube-system   pod/svclb-traefik-4plfr                      2/2     Running            0              19m     10.42.2.2   ip-172-31-9-139    <none>           <none>
kube-system   pod/local-path-provisioner-64ffb68fd-z8pl6   1/1     Running            0              9m25s   10.42.2.3   ip-172-31-9-139    <none>           <none>
kube-system   pod/metrics-server-9cf544f65-fbs5v           1/1     Running            0              9m23s   10.42.2.4   ip-172-31-9-139    <none>           <none>
kube-system   pod/coredns-85cb69466-7cx2s                  1/1     Running            0              9m25s   10.42.1.3   ip-172-31-1-73     <none>           <none>
kube-system   pod/helm-install-traefik-crd-pz67q           0/1     CrashLoopBackOff   2 (9m1s ago)   9m22s   10.42.1.4   ip-172-31-1-73     <none>           <none>
kube-system   pod/traefik-97b44b794-wqqtg                  1/1     Running            0              21m     10.42.0.7   ip-172-31-13-221   <none>           <none>
kube-system   pod/helm-install-traefik-sptc5               0/1     CrashLoopBackOff   9 (23s ago)    9m22s   10.42.2.5   ip-172-31-9-139    <none>           <none>

Logs for those pods show (second one can be ignored since it's just waiting for the first one to complete):

$ k logs -n kube-system pod/helm-install-traefik-crd-pz67q
...
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

$ k logs -n kube-system pod/helm-install-traefik-sptc5
...
+ helm_v3 upgrade --set global.systemDefaultRegistry=0 traefik https://10.43.0.1:443/static/charts/traefik-10.3.0.tgz --values /config/values-01_HelmChart.yaml
Error: UPGRADE FAILED: execution error at (traefik/templates/validate-install-crd.yaml:19:7): Required CRDs are missing. Please install the traefik-crd chart before installing this chart.

I can't figure out a way to workaround this. Restarting the k3s services does not solve this.

brandond commented 3 years ago

Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

That's a fun one... I might have to phone a friend on how to handle that. I am guessing that we probably SHOULD have upgraded the helm chart to the new non-beta CRDs much earlier so that we're not stuck trying to upgrade from an old api version that's no longer served.

brandond commented 3 years ago

I reached out to @mattfarina and he suggested we look at https://github.com/helm/helm-mapkubeapis which sounds like it's meant to deal with exactly the situation we're in here:

The Helm documentation describes the problem when Helm releases that are already deployed with APIs that are no longer supported. If the Kubernetes cluster (containing such releases) is updated to a version where the APIs are removed, then Helm becomes unable to manage such releases anymore. It does not matter if the chart being passed in the upgrade contains the supported API versions or not.

This is what the mapkubeapis plugin resolves. It fixes the issue by mapping releases which contain deprecated or removed Kubernetes APIs to supported APIs. This is performed inline in the release metadata where the existing release is superseded and a new release (metadata only) is added. The deployed Kubernetes resources are updated automatically by Kubernetes during upgrade of its version. Once this operation is completed, you can then upgrade using the chart with supported APIs.

brandond commented 3 years ago

It looks like the mapkubeapis plugin handles this, see https://github.com/k3s-io/klipper-helm/pull/33

+ helm_update install
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^traefik-crd$' --namespace kube-system --output json++
jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=,deployed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ '' =~ ^(|null)$ ]]
+ [[ deployed =~ ^(|null)$ ]]
+ [[ deployed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ deployed == \d\e\p\l\o\y\e\d ]]
+ echo 'Already installed traefik-crd'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
Already installed traefik-crd
+ helm_v3 mapkubeapis traefik-crd --namespace kube-system
2021/09/14 21:51:19 Release 'traefik-crd' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.
2021/09/14 21:51:19 Get release 'traefik-crd' latest version.
2021/09/14 21:51:19 Check release 'traefik-crd' for deprecated or removed APIs...
2021/09/14 21:51:19 Found deprecated or removed Kubernetes API:
"apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition"
Supported API equivalent:
"apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition"
2021/09/14 21:51:19 Finished checking release 'traefik-crd' for deprecated or removed APIs.
2021/09/14 21:51:19 Deprecated or removed APIs exist, updating release: traefik-crd.
2021/09/14 21:51:19 Set status of release version 'traefik-crd.v1' to 'superseded'.
2021/09/14 21:51:19 Release version 'traefik-crd.v1' updated successfully.
2021/09/14 21:51:19 Add release version 'traefik-crd.v2' with updated supported APIs.
2021/09/14 21:51:19 Release version 'traefik-crd.v2' added successfully.
2021/09/14 21:51:19 Release 'traefik-crd' with deprecated or removed APIs updated successfully to new version.
2021/09/14 21:51:19 Map of release 'traefik-crd' deprecated or removed APIs to supported versions, completed successfully.
Upgrading traefik-crd
+ echo 'Upgrading traefik-crd'
+ shift 1
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Release "traefik-crd" has been upgraded. Happy Helming!
NAME: traefik-crd
LAST DEPLOYED: Tue Sep 14 21:51:19 2021
NAMESPACE: kube-system
STATUS: deployed
REVISION: 3
TEST SUITE: None
+ exit
rancher-max commented 3 years ago

This is still failing on master branch commit eda65b19d9893a033c681ab5b0045d7a0e29d6cb with the same error Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1". Confirmed it is using bumped version rancher/klipper-helm:v0.6.5-build20210915

brandond commented 3 years ago

@rancher-max can you provide the full pod log so we can see whether or not it's running the plugin to migrate the old apiversions?

rancher-max commented 3 years ago
$ k logs -n kube-system pod/helm-install-traefik-crd-67p74
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' '' '!=' true ']'
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
+ tiller --listen=127.0.0.1:44134 --storage=secret
[main] 2021/09/16 19:12:49 Starting Tiller v2.17.0 (tls=false)
[main] 2021/09/16 19:12:49 GRPC listening on 127.0.0.1:44134
[main] 2021/09/16 19:12:49 Probes listening on :44135
[main] 2021/09/16 19:12:49 Storage driver is Secret
[main] 2021/09/16 19:12:49 Max history per release is 0
Creating /root/.helm 
Creating /root/.helm/repository 
Creating /root/.helm/repository/cache 
Creating /root/.helm/repository/local 
Creating /root/.helm/plugins 
Creating /root/.helm/starters 
Creating /root/.helm/cache/archive 
Creating /root/.helm/repository/repositories.yaml 
Adding stable repo with URL: https://charts.helm.sh/stable/ 
Adding local repo with URL: http://127.0.0.1:8879/charts 
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ jq -r '.Releases | length'
++ helm_v2 ls --all '^traefik-crd$' --output json
[storage] 2021/09/16 19:12:49 listing all releases with filter
+ EXIST=
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/traefik-crd.tgz.base64
+ CHART_PATH=/traefik-crd.tgz
+ '[' '!' -f /chart/traefik-crd.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
+ echo 'chart path is a url, skipping repo update'
chart path is a url, skipping repo update
+ helm_v3 repo remove stable
Error: no repositories configured
+ true
+ return
+ helm_update install
+ '[' helm_v3 == helm_v3 ']'
++ helm_v3 ls -f '^traefik-crd$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=,deployed
++ echo ,deployed
++ cut -f1 -d,
+ INSTALLED_VERSION=
++ echo ++ cut -f2 -d,
,deployed
+ STATUS=deployed
+ VALUES=
+ '[' install = delete ']'
+ '[' -z '' ']'
+ '[' -z deployed ']'
+ '[' deployed = deployed ']'
+ echo Already installed traefik-crd, upgrading
+ shift 1
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Already installed traefik-crd, upgrading
Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
brandond commented 3 years ago

OK so that is running with the old image that doesn't have the plugin to fix the api versions. I am guessing that the old server is currently the leader for helm-controller so it's creating jobs with the unfixed image. This should resolve itself after both nodes are upgraded. If it does not, then we have a problem.

rancher-max commented 3 years ago

Yeah I've upgraded all nodes and it's still having the problem:

$ kubectl get nodes,pods -A -o wide
NAME                    STATUS   ROLES                  AGE   VERSION                INTERNAL-IP     EXTERNAL-IP      OS-IMAGE           KERNEL-VERSION   CONTAINER-RUNTIME
node/ip-172-31-2-200    Ready    control-plane,master   90m   v1.22.1+k3s-eda65b19   172.31.2.200    <redacted>       Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.5.5-k3s1
node/ip-172-31-15-146   Ready    <none>                 86m   v1.22.1+k3s-eda65b19   172.31.15.146   <redacted>       Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.5.5-k3s1
node/ip-172-31-11-75    Ready    control-plane,master   87m   v1.22.1+k3s-eda65b19   172.31.11.75    <redacted>       Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.5.5-k3s1

NAMESPACE     NAME                                         READY   STATUS             RESTARTS       AGE   IP          NODE               NOMINATED NODE   READINESS GATES
kube-system   pod/svclb-traefik-m2qj9                      2/2     Running            0              89m   10.42.0.8   ip-172-31-2-200    <none>           <none>
kube-system   pod/svclb-traefik-jkckx                      2/2     Running            0              87m   10.42.1.2   ip-172-31-11-75    <none>           <none>
kube-system   pod/svclb-traefik-qvhw8                      2/2     Running            0              86m   10.42.2.2   ip-172-31-15-146   <none>           <none>
kube-system   pod/local-path-provisioner-64ffb68fd-cfw6q   1/1     Running            0              36m   10.42.2.3   ip-172-31-15-146   <none>           <none>
kube-system   pod/metrics-server-9cf544f65-mrf8q           1/1     Running            0              36m   10.42.1.4   ip-172-31-11-75    <none>           <none>
kube-system   pod/coredns-85cb69466-sdw4v                  1/1     Running            0              36m   10.42.1.3   ip-172-31-11-75    <none>           <none>
kube-system   pod/traefik-97b44b794-8t4q2                  1/1     Running            0              89m   10.42.0.7   ip-172-31-2-200    <none>           <none>
kube-system   pod/helm-install-traefik-krx6t               0/1     CrashLoopBackOff   16 (34s ago)   36m   10.42.2.5   ip-172-31-15-146   <none>           <none>
kube-system   pod/helm-install-traefik-crd-67p74           0/1     CrashLoopBackOff   16 (28s ago)   36m   10.42.2.4   ip-172-31-15-146   <none>           <none>
rancher-max commented 3 years ago

It looks like it has the new image from the job though?

NAME                                 COMPLETIONS   DURATION   AGE   CONTAINERS   IMAGES                                      SELECTOR
job.batch/helm-install-traefik-crd   0/1                      36m   helm         rancher/klipper-helm:v0.6.5-build20210915   controller-uid=03005947-3975-40b3-8c56-2188fb059562
job.batch/helm-install-traefik       0/1                      36m   helm         rancher/klipper-helm:v0.6.5-build20210915   controller-uid=432886c0-16a4-4422-b484-4ab4e3e09598
brandond commented 3 years ago

It does look like that, but the message you're getting no longer exists in that version of the image: + echo Already installed traefik-crd, upgrading

https://github.com/k3s-io/klipper-helm/commit/94187599274003492af29c4d6f3fde19d90c900e

Can you inspect the Pods, not the Job? I wonder if there's something going on with the job controller where it's not picking up the image change.

rancher-max commented 3 years ago

Interesting, yeah it looks like it's stuck with its old image

apiVersion: v1
kind: Pod
metadata:
  annotations:
    helmcharts.helm.cattle.io/configHash: SHA256=E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
  creationTimestamp: "2021-09-16T19:07:27Z"
  generateName: helm-install-traefik-crd-
  labels:
    controller-uid: 277897c9-6ee3-4a6e-ad92-6f9b8b9daa7b
    helmcharts.helm.cattle.io/chart: traefik-crd
    job-name: helm-install-traefik-crd
  name: helm-install-traefik-crd-67p74
  namespace: kube-system
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: helm-install-traefik-crd
    uid: 277897c9-6ee3-4a6e-ad92-6f9b8b9daa7b
  resourceVersion: "22042"
  uid: 8c1e0f4d-c48f-4869-a040-24c072c8f484
spec:
  containers:
  - args:
    - install
    env:
    - name: NAME
      value: traefik-crd
    - name: VERSION
    - name: REPO
    - name: HELM_DRIVER
      value: secret
    - name: CHART_NAMESPACE
      value: kube-system
    - name: CHART
      value: https://%{KUBERNETES_API}%/static/charts/traefik-crd-10.3.0.tgz
    - name: HELM_VERSION
    - name: TARGET_NAMESPACE
      value: kube-system
    - name: NO_PROXY
      value: .svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
    image: rancher/klipper-helm:v0.5.0-build20210505
    imagePullPolicy: IfNotPresent
    name: helm
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: values
    - mountPath: /chart
      name: content
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b7c7v
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-172-31-15-146
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: helm-traefik-crd
  serviceAccountName: helm-traefik-crd
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: chart-values-traefik-crd
    name: values
  - configMap:
      defaultMode: 420
      name: chart-content-traefik-crd
    name: content
  - name: kube-api-access-b7c7v
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-16T19:07:27Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-16T20:19:43Z"
    message: 'containers with unready status: [helm]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-16T20:19:43Z"
    message: 'containers with unready status: [helm]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-16T19:07:27Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://07e716aaeaa141994fcc982bcac4f105558e8545db1042588d8edc9859decf7a
    image: docker.io/rancher/klipper-helm:v0.5.0-build20210505
    imageID: docker.io/rancher/klipper-helm@sha256:ce86a3b3e258992779856d98ba3f6b7235cde99fcd28d32e328a5549ce29c702
    lastState:
      terminated:
        containerID: containerd://07e716aaeaa141994fcc982bcac4f105558e8545db1042588d8edc9859decf7a
        exitCode: 1
        finishedAt: "2021-09-16T20:19:43Z"
        reason: Error
        startedAt: "2021-09-16T20:19:42Z"
    name: helm
    ready: false
    restartCount: 23
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=helm pod=helm-install-traefik-crd-67p74_kube-system(8c1e0f4d-c48f-4869-a040-24c072c8f484)
        reason: CrashLoopBackOff
  hostIP: 172.31.15.146
  phase: Running
  podIP: 10.42.2.4
  podIPs:
  - ip: 10.42.2.4
  qosClass: BestEffort
  startTime: "2021-09-16T19:07:27Z"
brandond commented 3 years ago

There's definitely something going on with the job, the controller-uid label on the pods doesn't match that of the job.

brandond commented 3 years ago

Something bad is happening with the Job controller on 1.22; even after deleting the CrashLoopBackoff pods it isn't creating new ones to replace them. This is probably an upstream issue.

There is a potential workaround though, let me try something.

rancher-max commented 3 years ago

This issue was specifically for the aforementioned comments. This is resolved on v1.22.2-rc1+k3s1, and any additional issues that turn up will be covered with new issues.