Closed rancher-max closed 3 years ago
In this case now I don't see panics in the k3s logs, but I do have crashloopbackoffs on traefik installs:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/svclb-traefik-xls49 2/2 Running 0 21m 10.42.0.8 ip-172-31-13-221 <none> <none>
kube-system pod/svclb-traefik-c97vg 2/2 Running 0 20m 10.42.1.2 ip-172-31-1-73 <none> <none>
kube-system pod/svclb-traefik-4plfr 2/2 Running 0 19m 10.42.2.2 ip-172-31-9-139 <none> <none>
kube-system pod/local-path-provisioner-64ffb68fd-z8pl6 1/1 Running 0 9m25s 10.42.2.3 ip-172-31-9-139 <none> <none>
kube-system pod/metrics-server-9cf544f65-fbs5v 1/1 Running 0 9m23s 10.42.2.4 ip-172-31-9-139 <none> <none>
kube-system pod/coredns-85cb69466-7cx2s 1/1 Running 0 9m25s 10.42.1.3 ip-172-31-1-73 <none> <none>
kube-system pod/helm-install-traefik-crd-pz67q 0/1 CrashLoopBackOff 2 (9m1s ago) 9m22s 10.42.1.4 ip-172-31-1-73 <none> <none>
kube-system pod/traefik-97b44b794-wqqtg 1/1 Running 0 21m 10.42.0.7 ip-172-31-13-221 <none> <none>
kube-system pod/helm-install-traefik-sptc5 0/1 CrashLoopBackOff 9 (23s ago) 9m22s 10.42.2.5 ip-172-31-9-139 <none> <none>
Logs for those pods show (second one can be ignored since it's just waiting for the first one to complete):
$ k logs -n kube-system pod/helm-install-traefik-crd-pz67q
...
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
$ k logs -n kube-system pod/helm-install-traefik-sptc5
...
+ helm_v3 upgrade --set global.systemDefaultRegistry=0 traefik https://10.43.0.1:443/static/charts/traefik-10.3.0.tgz --values /config/values-01_HelmChart.yaml
Error: UPGRADE FAILED: execution error at (traefik/templates/validate-install-crd.yaml:19:7): Required CRDs are missing. Please install the traefik-crd chart before installing this chart.
I can't figure out a way to workaround this. Restarting the k3s services does not solve this.
Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
That's a fun one... I might have to phone a friend on how to handle that. I am guessing that we probably SHOULD have upgraded the helm chart to the new non-beta CRDs much earlier so that we're not stuck trying to upgrade from an old api version that's no longer served.
I reached out to @mattfarina and he suggested we look at https://github.com/helm/helm-mapkubeapis which sounds like it's meant to deal with exactly the situation we're in here:
The Helm documentation describes the problem when Helm releases that are already deployed with APIs that are no longer supported. If the Kubernetes cluster (containing such releases) is updated to a version where the APIs are removed, then Helm becomes unable to manage such releases anymore. It does not matter if the chart being passed in the upgrade contains the supported API versions or not.
This is what the mapkubeapis plugin resolves. It fixes the issue by mapping releases which contain deprecated or removed Kubernetes APIs to supported APIs. This is performed inline in the release metadata where the existing release is superseded and a new release (metadata only) is added. The deployed Kubernetes resources are updated automatically by Kubernetes during upgrade of its version. Once this operation is completed, you can then upgrade using the chart with supported APIs.
It looks like the mapkubeapis plugin handles this, see https://github.com/k3s-io/klipper-helm/pull/33
+ helm_update install
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^traefik-crd$' --namespace kube-system --output json++
jq -r '"\(.[0].app_version),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
+ LINE=,deployed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ '' =~ ^(|null)$ ]]
+ [[ deployed =~ ^(|null)$ ]]
+ [[ deployed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ deployed == \d\e\p\l\o\y\e\d ]]
+ echo 'Already installed traefik-crd'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
Already installed traefik-crd
+ helm_v3 mapkubeapis traefik-crd --namespace kube-system
2021/09/14 21:51:19 Release 'traefik-crd' will be checked for deprecated or removed Kubernetes APIs and will be updated if necessary to supported API versions.
2021/09/14 21:51:19 Get release 'traefik-crd' latest version.
2021/09/14 21:51:19 Check release 'traefik-crd' for deprecated or removed APIs...
2021/09/14 21:51:19 Found deprecated or removed Kubernetes API:
"apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition"
Supported API equivalent:
"apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition"
2021/09/14 21:51:19 Finished checking release 'traefik-crd' for deprecated or removed APIs.
2021/09/14 21:51:19 Deprecated or removed APIs exist, updating release: traefik-crd.
2021/09/14 21:51:19 Set status of release version 'traefik-crd.v1' to 'superseded'.
2021/09/14 21:51:19 Release version 'traefik-crd.v1' updated successfully.
2021/09/14 21:51:19 Add release version 'traefik-crd.v2' with updated supported APIs.
2021/09/14 21:51:19 Release version 'traefik-crd.v2' added successfully.
2021/09/14 21:51:19 Release 'traefik-crd' with deprecated or removed APIs updated successfully to new version.
2021/09/14 21:51:19 Map of release 'traefik-crd' deprecated or removed APIs to supported versions, completed successfully.
Upgrading traefik-crd
+ echo 'Upgrading traefik-crd'
+ shift 1
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Release "traefik-crd" has been upgraded. Happy Helming!
NAME: traefik-crd
LAST DEPLOYED: Tue Sep 14 21:51:19 2021
NAMESPACE: kube-system
STATUS: deployed
REVISION: 3
TEST SUITE: None
+ exit
This is still failing on master branch commit eda65b19d9893a033c681ab5b0045d7a0e29d6cb
with the same error Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
. Confirmed it is using bumped version rancher/klipper-helm:v0.6.5-build20210915
@rancher-max can you provide the full pod log so we can see whether or not it's running the plugin to migrate the old apiversions?
$ k logs -n kube-system pod/helm-install-traefik-crd-67p74
CHART=$(sed -e "s/%{KUBERNETES_API}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/g" <<< "${CHART}")
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ '[' '' '!=' true ']'
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
+ tiller --listen=127.0.0.1:44134 --storage=secret
[main] 2021/09/16 19:12:49 Starting Tiller v2.17.0 (tls=false)
[main] 2021/09/16 19:12:49 GRPC listening on 127.0.0.1:44134
[main] 2021/09/16 19:12:49 Probes listening on :44135
[main] 2021/09/16 19:12:49 Storage driver is Secret
[main] 2021/09/16 19:12:49 Max history per release is 0
Creating /root/.helm
Creating /root/.helm/repository
Creating /root/.helm/repository/cache
Creating /root/.helm/repository/local
Creating /root/.helm/plugins
Creating /root/.helm/starters
Creating /root/.helm/cache/archive
Creating /root/.helm/repository/repositories.yaml
Adding stable repo with URL: https://charts.helm.sh/stable/
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ jq -r '.Releases | length'
++ helm_v2 ls --all '^traefik-crd$' --output json
[storage] 2021/09/16 19:12:49 listing all releases with filter
+ EXIST=
+ '[' '' == 1 ']'
+ '[' '' == v2 ']'
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/traefik-crd.tgz.base64
+ CHART_PATH=/traefik-crd.tgz
+ '[' '!' -f /chart/traefik-crd.tgz.base64 ']'
+ return
+ '[' install '!=' delete ']'
+ helm_repo_init
+ grep -q -e 'https\?://'
+ echo 'chart path is a url, skipping repo update'
chart path is a url, skipping repo update
+ helm_v3 repo remove stable
Error: no repositories configured
+ true
+ return
+ helm_update install
+ '[' helm_v3 == helm_v3 ']'
++ helm_v3 ls -f '^traefik-crd$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=,deployed
++ echo ,deployed
++ cut -f1 -d,
+ INSTALLED_VERSION=
++ echo ++ cut -f2 -d,
,deployed
+ STATUS=deployed
+ VALUES=
+ '[' install = delete ']'
+ '[' -z '' ']'
+ '[' -z deployed ']'
+ '[' deployed = deployed ']'
+ echo Already installed traefik-crd, upgrading
+ shift 1
+ helm_v3 upgrade traefik-crd https://10.43.0.1:443/static/charts/traefik-crd-10.3.0.tgz
Already installed traefik-crd, upgrading
Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
OK so that is running with the old image that doesn't have the plugin to fix the api versions. I am guessing that the old server is currently the leader for helm-controller so it's creating jobs with the unfixed image. This should resolve itself after both nodes are upgraded. If it does not, then we have a problem.
Yeah I've upgraded all nodes and it's still having the problem:
$ kubectl get nodes,pods -A -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/ip-172-31-2-200 Ready control-plane,master 90m v1.22.1+k3s-eda65b19 172.31.2.200 <redacted> Ubuntu 20.04 LTS 5.4.0-1009-aws containerd://1.5.5-k3s1
node/ip-172-31-15-146 Ready <none> 86m v1.22.1+k3s-eda65b19 172.31.15.146 <redacted> Ubuntu 20.04 LTS 5.4.0-1009-aws containerd://1.5.5-k3s1
node/ip-172-31-11-75 Ready control-plane,master 87m v1.22.1+k3s-eda65b19 172.31.11.75 <redacted> Ubuntu 20.04 LTS 5.4.0-1009-aws containerd://1.5.5-k3s1
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/svclb-traefik-m2qj9 2/2 Running 0 89m 10.42.0.8 ip-172-31-2-200 <none> <none>
kube-system pod/svclb-traefik-jkckx 2/2 Running 0 87m 10.42.1.2 ip-172-31-11-75 <none> <none>
kube-system pod/svclb-traefik-qvhw8 2/2 Running 0 86m 10.42.2.2 ip-172-31-15-146 <none> <none>
kube-system pod/local-path-provisioner-64ffb68fd-cfw6q 1/1 Running 0 36m 10.42.2.3 ip-172-31-15-146 <none> <none>
kube-system pod/metrics-server-9cf544f65-mrf8q 1/1 Running 0 36m 10.42.1.4 ip-172-31-11-75 <none> <none>
kube-system pod/coredns-85cb69466-sdw4v 1/1 Running 0 36m 10.42.1.3 ip-172-31-11-75 <none> <none>
kube-system pod/traefik-97b44b794-8t4q2 1/1 Running 0 89m 10.42.0.7 ip-172-31-2-200 <none> <none>
kube-system pod/helm-install-traefik-krx6t 0/1 CrashLoopBackOff 16 (34s ago) 36m 10.42.2.5 ip-172-31-15-146 <none> <none>
kube-system pod/helm-install-traefik-crd-67p74 0/1 CrashLoopBackOff 16 (28s ago) 36m 10.42.2.4 ip-172-31-15-146 <none> <none>
It looks like it has the new image from the job though?
NAME COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/helm-install-traefik-crd 0/1 36m helm rancher/klipper-helm:v0.6.5-build20210915 controller-uid=03005947-3975-40b3-8c56-2188fb059562
job.batch/helm-install-traefik 0/1 36m helm rancher/klipper-helm:v0.6.5-build20210915 controller-uid=432886c0-16a4-4422-b484-4ab4e3e09598
It does look like that, but the message you're getting no longer exists in that version of the image:
+ echo Already installed traefik-crd, upgrading
https://github.com/k3s-io/klipper-helm/commit/94187599274003492af29c4d6f3fde19d90c900e
Can you inspect the Pods, not the Job? I wonder if there's something going on with the job controller where it's not picking up the image change.
Interesting, yeah it looks like it's stuck with its old image
apiVersion: v1
kind: Pod
metadata:
annotations:
helmcharts.helm.cattle.io/configHash: SHA256=E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
creationTimestamp: "2021-09-16T19:07:27Z"
generateName: helm-install-traefik-crd-
labels:
controller-uid: 277897c9-6ee3-4a6e-ad92-6f9b8b9daa7b
helmcharts.helm.cattle.io/chart: traefik-crd
job-name: helm-install-traefik-crd
name: helm-install-traefik-crd-67p74
namespace: kube-system
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: helm-install-traefik-crd
uid: 277897c9-6ee3-4a6e-ad92-6f9b8b9daa7b
resourceVersion: "22042"
uid: 8c1e0f4d-c48f-4869-a040-24c072c8f484
spec:
containers:
- args:
- install
env:
- name: NAME
value: traefik-crd
- name: VERSION
- name: REPO
- name: HELM_DRIVER
value: secret
- name: CHART_NAMESPACE
value: kube-system
- name: CHART
value: https://%{KUBERNETES_API}%/static/charts/traefik-crd-10.3.0.tgz
- name: HELM_VERSION
- name: TARGET_NAMESPACE
value: kube-system
- name: NO_PROXY
value: .svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
image: rancher/klipper-helm:v0.5.0-build20210505
imagePullPolicy: IfNotPresent
name: helm
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /config
name: values
- mountPath: /chart
name: content
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-b7c7v
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-172-31-15-146
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
serviceAccount: helm-traefik-crd
serviceAccountName: helm-traefik-crd
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- configMap:
defaultMode: 420
name: chart-values-traefik-crd
name: values
- configMap:
defaultMode: 420
name: chart-content-traefik-crd
name: content
- name: kube-api-access-b7c7v
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-16T19:07:27Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-09-16T20:19:43Z"
message: 'containers with unready status: [helm]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-09-16T20:19:43Z"
message: 'containers with unready status: [helm]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-09-16T19:07:27Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://07e716aaeaa141994fcc982bcac4f105558e8545db1042588d8edc9859decf7a
image: docker.io/rancher/klipper-helm:v0.5.0-build20210505
imageID: docker.io/rancher/klipper-helm@sha256:ce86a3b3e258992779856d98ba3f6b7235cde99fcd28d32e328a5549ce29c702
lastState:
terminated:
containerID: containerd://07e716aaeaa141994fcc982bcac4f105558e8545db1042588d8edc9859decf7a
exitCode: 1
finishedAt: "2021-09-16T20:19:43Z"
reason: Error
startedAt: "2021-09-16T20:19:42Z"
name: helm
ready: false
restartCount: 23
started: false
state:
waiting:
message: back-off 5m0s restarting failed container=helm pod=helm-install-traefik-crd-67p74_kube-system(8c1e0f4d-c48f-4869-a040-24c072c8f484)
reason: CrashLoopBackOff
hostIP: 172.31.15.146
phase: Running
podIP: 10.42.2.4
podIPs:
- ip: 10.42.2.4
qosClass: BestEffort
startTime: "2021-09-16T19:07:27Z"
There's definitely something going on with the job, the controller-uid label on the pods doesn't match that of the job.
Something bad is happening with the Job controller on 1.22; even after deleting the CrashLoopBackoff pods it isn't creating new ones to replace them. This is probably an upstream issue.
There is a potential workaround though, let me try something.
This issue was specifically for the aforementioned comments. This is resolved on v1.22.2-rc1+k3s1, and any additional issues that turn up will be covered with new issues.
I had a cluster using an external sql db with 2 servers, 1 agent, all on v1.21.1+k3s1. I performed a manual (curl) upgrade to master commit
90960ebf4e7a08df075526f411e3afe06731b01e
on one server. This caused a crash loop in the k3s logs with:After upgrading the second server, this crash went away and both servers were running successfully. https://github.com/k3s-io/k3s/pull/3993