Open GonzoHsu opened 1 year ago
Yep, we are also experiencing the same behavior.
I'm unable to replicate this. What version of k3s or helm-controller are you on? The code here should delete any objects owned by the HelmChart: https://github.com/k3s-io/helm-controller/blob/57fde4639aa8e1b01533006c02bcfce9367b7c3e/pkg/controllers/chart/chart.go#L253-L265
The k3s version I used is: k3s version v1.23.16+k3s1 (64b0feeb)
This condition will not happen every time, if you tried replicate it in a short time slot, you may not able to find it.
I also check the code and I think the following part cause that, but I am not sure what timing can cause that. https://github.com/k3s-io/helm-controller/blob/57fde4639aa8e1b01533006c02bcfce9367b7c3e/pkg/controllers/chart/chart.go#L215-L220
When the condition happened, not only the delete job, but also some configmaps of helmchart is created again.
This condition will not happen every time, if you tried replicate it in a short time slot, you may not able to find it.
I've run the end to end tests quite a few times and not been able to reproduce it; do you have any circumstances or specific steps that seem to contribute to it? Deleting the HelmChart too quickly after creating it, deleting the namespace before the HelmChart, or so on?
My circumstance is deploying a Rocky Linux VM on VMware ESXi server with 2 vCPU and 8G RAM, then install k3s on the VM.
I didn't do specific steps to create/delete the HelmChart, just use command "kubectl apply/delete -f
How long did you wait between applying and deleting it? Did the install succeed? Was it still mid-install when deleted?
The install is succeed, I can see my pod runs without problem, not in mid-install state. The time between applying and deleting can be few minutes to few days, I didn't find relation about it.
I suspect we encounter similar issue in Harvester.
When create/delete the same HelmChart
(only deployment, not CRD) frequently, there are bugs:
The previous left delete job, will be picked by the next time re-installed helmchart
, when which is deleted again.
harv41:/home/rancher # kk get jobs -A && kk get jobs -n cattle-monitoring-system helm-delete-rancher-monitoring -oyaml
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-monitoring-system helm-delete-rancher-monitoring 1/1 4s 3m41s
kube-system helm-install-rke2-canal 1/1 12s 7d2h
status:
completionTime: "2023-06-29T10:31:52Z"
conditions:
- lastProbeTime: "2023-06-29T10:31:52Z"
lastTransitionTime: "2023-06-29T10:31:52Z"
status: "True"
type: Complete
ready: 0
startTime: "2023-06-29T10:31:48Z"
succeeded: 1
uncountedTerminatedPods: {}
This bug, causes the HelmChart
related downstream deployments are still left there.
Harvester addon is on top of HelmChart
, and we are adding workaround: each time when deploy
or delete
the HelmChart
, delete the potential previous job first.
https://github.com/harvester/harvester/issues/4108#issuecomment-1601679054
@brandond
The JobName
of a chart is fixed, not generated, if previous job is not out-dated, it maybe re-picked next time, and assume the job is done quickly.
From my test in Harvester, it is highly possible due to this, both install
and delete
.
harv41:/home/rancher # kk get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-monitoring-system helm-install-rancher-monitoring 1/1 52s 3m13s
One HelmChart
delete seems triggers 2 times of delete job
, the HelmChart
is gone, but one job is left, and it will affect next round delete.
Did following test, log events and jobs:
First round:
1.1 create HelmChart
cattle-monitoring-system 13m Normal SuccessfulCreate job/helm-install-rancher-monitoring Created pod: helm-install-rancher-monitoring-wnrm6
cattle-monitoring-system 13m Normal Completed job/helm-install-rancher-monitoring Job completed
1.2 delete Helmchart
cattle-monitoring-system 12m Normal SuccessfulCreate job/helm-delete-rancher-monitoring Created pod: helm-delete-rancher-monitoring-d2n62
cattle-monitoring-system 12m Normal Completed job/helm-delete-rancher-monitoring Job completed
cattle-monitoring-system 12m Normal SuccessfulCreate job/helm-delete-rancher-monitoring Created pod: helm-delete-rancher-monitoring-7lkj9
cattle-monitoring-system 12m Normal Completed job/helm-delete-rancher-monitoring Job completed
1.3 2 jobs are left:
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-monitoring-system helm-delete-rancher-monitoring 1/1 4s 9m48s
cattle-monitoring-system helm-install-rancher-monitoring 1/1 39s 47s
Second round:
2.1 create HelmChart
cattle-monitoring-system 3m10s Normal SuccessfulCreate job/helm-install-rancher-monitoring Created pod: helm-install-rancher-monitoring-b2kcz
cattle-monitoring-system 2m31s Normal Completed job/helm-install-rancher-monitoring Job completed
2.2 delete Helmchart
cattle-monitoring-system 117s Normal SuccessfulCreate job/helm-delete-rancher-monitoring Created pod: helm-delete-rancher-monitoring-5dvlw
cattle-monitoring-system 107s Normal Completed job/helm-delete-rancher-monitoring Job completed
cattle-monitoring-system 105s Normal SuccessfulCreate job/helm-delete-rancher-monitoring Created pod: helm-delete-rancher-monitoring-8dfh9
cattle-monitoring-system 101s Normal Completed job/helm-delete-rancher-monitoring Job completed
2.3 1 job is left
NAMESPACE NAME COMPLETIONS DURATION AGE
cattle-monitoring-system helm-delete-rancher-monitoring 1/1 4s 2m57s
With a workaround in Harvester to forcely delete the job before trigger HelmChart
action, we can avoid such issue now.
https://github.com/harvester/harvester/pull/4127
time="2023-06-29T14:56:52Z" level=info msg="OnChange: user disable addon, move from AddonDeploySuccessful to new disable status AddonDisabling"
///// before delete HelmChart, delete those 2 left jobs
time="2023-06-29T14:56:52Z" level=info msg="previous job cattle-monitoring-system/helm-delete-rancher-monitoring is to be deleted, wait"
W0629 14:56:52.802777 7 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning
time="2023-06-29T14:56:52Z" level=info msg="previous job cattle-monitoring-system/helm-install-rancher-monitoring is to be deleted, wait"
W0629 14:56:52.813165 7 warnings.go:70] child pods are preserved by default when jobs are deleted; set propagationPolicy=Background to remove them or set propagationPolicy=Orphan to suppress this warning
time="2023-06-29T14:56:57Z" level=info msg="delete the helm chart cattle-monitoring-system/rancher-monitoring"
...
time="2023-06-29T14:57:10Z" level=info msg="addon rancher-monitoring: helm chart is gone, or owned false, addon is in AddonDisabling status, move to init state"
@w13915984028 I am curious why you don't see any of the events from the helm controller itself when you delete the chart. Are you filtering these out? What events and log messages do you see?
Rather than adding code to Harvester to manually clean up after helm-controller, would you mind making an attempt to fix the unwanted behavior here? It should be somewhere in the OnRemove function at https://github.com/k3s-io/helm-controller/blob/master/pkg/controllers/chart/chart.go#L201.
@brandond We meet the same problem, and it's real weird that HelmChart delete seems triggers 2 times of delete job. Here is our environment version: k3s version v1.28.4+k3s2 (6ba6c1b6) go version go1.20.11 Did Following test,command ,log event and jobs: 1.First Round: Apply the test helmchart.
[root@localhost k8s_env]# kubectl apply -f run-test.yaml
helmchart.helm.cattle.io/test created
Check the helm-install job ,event logs and pod resource
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
kube-system helm-install-test 0/1 9s 9s
[root@localhost k8s_env]# kubectl get event -A
kube-system 4s Normal SuccessfulCreate job/helm-install-test Created pod: helm-install-test-5v2kp
kube-system 4s Normal Scheduled pod/helm-install-test-5v2kp Successfully assigned kube-system/helm-install-test-5v2kp to localhost
kube-system 4s Normal ApplyJob helmchart/test Applying HelmChart using Job kube-system/helm-install-test
kube-system 4s Normal AddedInterface pod/helm-install-test-5v2kp Add eth0 [10.42.2.123/16] from cnibr
kube-system 4s Normal Pulled pod/helm-install-test-5v2kp Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 4s Normal Created pod/helm-install-test-5v2kp Created container helm
kube-system 4s Normal Started pod/helm-install-test-5v2kp Started container helm
default 3s Normal ScalingReplicaSet deployment/test Scaled up replica set test-9f64fdb7f to 1
default 3s Normal SuccessfulCreate replicaset/test-9f64fdb7f Created pod: test-9f64fdb7f-smf22
default 3s Normal Scheduled pod/test-9f64fdb7f-smf22 Successfully assigned default/test-9f64fdb7f-smf22 to localhost
default 3s Normal AddedInterface pod/test-9f64fdb7f-smf22 Add eth0 [10.42.2.124/16] from cnibr
default 3s Normal Pulled pod/test-9f64fdb7f-smf22 Container image "test" already present on machine
default 3s Normal Created pod/test-9f64fdb7f-smf22 Created container test
default 3s Normal Started pod/test-9f64fdb7f-smf22 Started container test
kube-system 1s Normal Completed job/helm-install-test Job completed
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-smf22 1/1 Running 0 3m13s
Delete helmchart after pod running and check event logs and job.
[root@localhost k8s_env]# kubectl delete -f run-test.yaml
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-smf22 1/1 Terminating 0 4m50s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
kube-system helm-delete-test 1/1 3s 15s|
[root@localhost k8s_env]# kubectl get event -A
kube-system 61s Normal SuccessfulCreate job/helm-delete-test Created pod: helm-delete-test-49284
kube-system 60s Normal Scheduled pod/helm-delete-test-49284 Successfully assigned kube-system/helm-delete-test-49284 to localhost
kube-system 61s Normal AddedInterface pod/helm-delete-test-49284 Add eth0 [10.42.2.126/16] from cnibr
kube-system 60s Normal Pulled pod/helm-delete-test-49284 Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 60s Normal Created pod/helm-delete-test-49284 Created container helm
kube-system 60s Normal Started pod/helm-delete-test-49284 Started container helm
default 60s Normal Killing pod/test-9f64fdb7f-smf22 Stopping container test
kube-system 58s Normal Completed job/helm-delete-test Job completed
kube-system 55s Normal RemoveJob helmchart/test Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
kube-system 55s Normal SuccessfulCreate job/helm-delete-test Created pod: helm-delete-test-vj4b2
kube-system 54s Normal Scheduled pod/helm-delete-test-vj4b2 Successfully assigned kube-system/helm-delete-test-vj4b2 to localhost
kube-system 55s Normal AddedInterface pod/helm-delete-test-vj4b2 Add eth0 [10.42.2.127/16] from cnibr
kube-system 54s Normal Pulled pod/helm-delete-test-vj4b2 Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 54s Normal Created pod/helm-delete-test-vj4b2 Created container helm
kube-system 54s Normal Started pod/helm-delete-test-vj4b2 Started container helm
kube-system 52s Normal Completed job/helm-delete-test Job completed
[root@localhost k8s_env]# kubectl get pod
No resources found in default namespace.
From the log events and job , we find after the first delete job completed and removed ,k3s start a second delete job without removed,so there is a delete job left.
2.Second Round:
[root@localhost k8s_env]# kubectl apply -f run-test.yaml
helmchart.helm.cattle.io/test created
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-wbhrd 1/1 Running 0 5s
[root@localhost k8s_env]# kubectl get jobs.batch -A
NAMESPACE NAME COMPLETIONS DURATION AGE
kube-system helm-delete-test 1/1 3s 8m20s
kube-system helm-install-test 1/1 3s 13s
[root@localhost k8s_env]# kubectl get event -A
kube-system 48s Normal SuccessfulCreate job/helm-install-test Created pod: helm-install-test-9jkgj
kube-system 47s Normal Scheduled pod/helm-install-test-9jkgj Successfully assigned kube-system/helm-install-test-9jkgj to localhost
kube-system 48s Normal ApplyJob helmchart/test Applying HelmChart using Job kube-system/helm-install-test
kube-system 47s Normal AddedInterface pod/helm-install-test-9jkgj Add eth0 [10.42.2.128/16] from cnibr
kube-system 47s Normal Pulled pod/helm-install-test-9jkgj Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 47s Normal Created pod/helm-install-test-9jkgj Created container helm
kube-system 47s Normal Started pod/helm-install-test-9jkgj Started container helm
default 47s Normal ScalingReplicaSet deployment/test Scaled up replica set test-9f64fdb7f to 1
default 47s Normal SuccessfulCreate replicaset/test-9f64fdb7f Created pod: test-9f64fdb7f-wbhrd
default 46s Normal Scheduled pod/test-9f64fdb7f-wbhrd Successfully assigned default/test-9f64fdb7f-wbhrd to localhost
default 46s Normal AddedInterface pod/test-9f64fdb7f-wbhrd Add eth0 [10.42.2.129/16] from cnibr
default 46s Normal Pulled pod/test-9f64fdb7f-wbhrd Container image "test" already present on machine
default 46s Normal Created pod/test-9f64fdb7f-wbhrd Created container test
default 46s Normal Started pod/test-9f64fdb7f-wbhrd Started container test
kube-system 45s Normal Completed job/helm-install-test Job completed
[root@localhost k8s_env]# kubectl delete -f run-test.yaml
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-wbhrd 1/1 Running 0 3m6s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
[root@localhost k8s_env]# kubectl get event -A
kube-system 46s Normal RemoveJob helmchart/test Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
In the second round , helm-delete process is just remove the left job after first round,but the pod is still running.
3.Third Round: we want check more helm-controller debug logs,so we set disable-helm-controller arg,and download helm controller process with version v0.15.4 which is the same version used in k3s version v1.28.4 .After we run the program manually and do the same test,we get the different result.
[root@localhost k8s_env]# vim /usr/lib/systemd/system/k3s.service
[root@localhost k8s_env]# systemctl daemon-reload
[root@localhost k8s_env]# systemctl restart k3s
[root@localhost helm-controller]# ./helm-controller-amd64 --kubeconfig /etc/rancher/k3s/k3s.yaml &
[root@localhost k8s_env]# kubectl apply -f run-test.yaml
helmchart.helm.cattle.io/test created
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-cjlxc 1/1 Running 0 3s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
kube-system helm-install-test 1/1 4s 11s
[root@localhost k8s_env]# kubectl get event -A
kube-system 49s Normal SuccessfulCreate job/helm-install-test Created pod: helm-install-test-tmmgm
kube-system 48s Normal Scheduled pod/helm-install-test-tmmgm Successfully assigned kube-system/helm-install-test-tmmgm to localhost
kube-system 49s Normal AddedInterface pod/helm-install-test-tmmgm Add eth0 [10.42.2.144/16] from cnibr
kube-system 48s Normal Pulled pod/helm-install-test-tmmgm Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 48s Normal Created pod/helm-install-test-tmmgm Created container helm
kube-system 48s Normal Started pod/helm-install-test-tmmgm Started container helm
default 48s Normal ScalingReplicaSet deployment/test Scaled up replica set test-9f64fdb7f to 1
default 48s Normal SuccessfulCreate replicaset/test-9f64fdb7f Created pod: test-9f64fdb7f-cjlxc
default 47s Normal Scheduled pod/test-9f64fdb7f-cjlxc Successfully assigned default/test-9f64fdb7f-cjlxc to localhost
default 47s Normal AddedInterface pod/test-9f64fdb7f-cjlxc Add eth0 [10.42.2.145/16] from cnibr
default 47s Normal Pulled pod/test-9f64fdb7f-cjlxc Container image "test" already present on machine
default 47s Normal Created pod/test-9f64fdb7f-cjlxc Created container test
default 47s Normal Started pod/test-9f64fdb7f-cjlxc Started container test
kube-system 45s Normal Completed job/helm-install-test Job completed
kube-system 45s Normal ApplyJob helmchart/test Applying HelmChart using Job kube-system/helm-install-test
[root@localhost k8s_env]# kubectl delete -f run-test.yaml
helmchart.helm.cattle.io "test" deleted
[root@localhost k8s_env]# kubectl get pod
NAME READY STATUS RESTARTS AGE
test-9f64fdb7f-cjlxc 1/1 Terminating 0 2m45s
[root@localhost k8s_env]# kubectl get job -A
NAMESPACE NAME COMPLETIONS DURATION AGE
[root@localhost k8s_env]# kubectl get event -A
kube-system 52s Normal SuccessfulCreate job/helm-delete-test Created pod: helm-delete-test-sjfpn
kube-system 51s Normal Scheduled pod/helm-delete-test-sjfpn Successfully assigned kube-system/helm-delete-test-sjfpn to localhost
kube-system 52s Normal AddedInterface pod/helm-delete-test-sjfpn Add eth0 [10.42.2.146/16] from cnibr
kube-system 51s Normal Pulled pod/helm-delete-test-sjfpn Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 51s Normal Created pod/helm-delete-test-sjfpn Created container helm
kube-system 51s Normal Started pod/helm-delete-test-sjfpn Started container helm
default 51s Normal Killing pod/test-9f64fdb7f-cjlxc Stopping container test
kube-system 49s Normal Completed job/helm-delete-test Job completed
kube-system 46s Normal SuccessfulCreate job/helm-delete-test Created pod: helm-delete-test-qq5zc
kube-system 45s Normal Scheduled pod/helm-delete-test-qq5zc Successfully assigned kube-system/helm-delete-test-qq5zc to localhost
kube-system 46s Normal AddedInterface pod/helm-delete-test-qq5zc Add eth0 [10.42.2.147/16] from cnibr
kube-system 45s Normal Pulled pod/helm-delete-test-qq5zc Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
kube-system 45s Normal Created pod/helm-delete-test-qq5zc Created container helm
kube-system 45s Normal Started pod/helm-delete-test-qq5zc Started container helm
kube-system 43s Normal Completed job/helm-delete-test Job completed
kube-system 40s Normal RemoveJob helmchart/test Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
Here, you can see the different solutions to the test. The helm-delete job has been cleaned up and the event logs show that after two jobs are completed, the RemoveJob is triggered.
Here is helm-controller logs:
[root@localhost helm-controller]# INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255465", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
ERRO[0863] error syncing 'kube-system/test': handler helm-controller-chart-registration: helmcharts.helm.cattle.io "test" not found, requeuing
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255465", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
INFO[0863] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
INFO[0866] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
INFO[0866] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255472", FieldPath:""}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-test
ERRO[1011] error syncing 'kube-system/test': handler on-helm-chart-remove: waiting for delete of helm chart for kube-system/test by helm-delete-test, requeuing
INFO[1014] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255707", FieldPath:""}): type: 'Normal' reason: 'RemoveJob' Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
ERRO[1017] error syncing 'kube-system/test': handler on-helm-chart-remove: waiting for delete of helm chart for kube-system/test by helm-delete-test, requeuing
INFO[1020] Event(v1.ObjectReference{Kind:"HelmChart", Namespace:"kube-system", Name:"test", UID:"cd455f26-15c0-4318-85e7-d34b06aa4f04", APIVersion:"helm.cattle.io/v1", ResourceVersion:"255739", FieldPath:""}): type: 'Normal' reason: 'RemoveJob' Uninstalled HelmChart using Job kube-system/helm-delete-test, removing resources
It's strange that the test with different solutions resulted in the following questions: 1.Why was the helm-delete job triggered twice? 2.Check the logic and logs of helm-controller, if a job is complete ,the RemoveJob will be triggered.Why the RemoveJob was triggered once in first round and second round test which the helm-delete job was triggered twice?
Any resolution in 2024?
I reproduced it in my environment. There is a high probability of reproducing it in my environment. There are my questions: Why there is still a helm-delete-nginx job after the first delete manifest(Reproduce step 2) is executed successfully? Why are the deployment and pods related to the manifest not deleted after the second delete manifest(Reproduce step 4)?
Here is my environment and steps to reproduce. Environmental Info:
Linux localhost.localdomain 4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Nov 17 03:31:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install] WantedBy=multi-user.target
[Service] Type=notify EnvironmentFile=-/etc/default/%N EnvironmentFile=-/etc/sysconfig/%N EnvironmentFile=-/usr/lib/systemd/system/k3s.service.env KillMode=process Delegate=yes
LimitNOFILE=1048576 LimitNPROC=infinity LimitCORE=infinity TasksMax=infinity TimeoutStartSec=0 Restart=always RestartSec=5s ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null' ExecStartPre=-/sbin/modprobe br_netfilter ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/bin/k3s \ server \
Reproduce Preparation
- manifest content (The content is saved in mainfest.yaml in my env. )
apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: nginx namespace: kube-system spec: chart: https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz targetNamespace: default
- helm chart package:
I downloaded nginx-15.12.2.tgz from https://artifacthub.io/packages/helm/bitnami/nginx and then copied it into /var/lib/rancher/k3s/server/static/charts.
Reproduce Step:
- ### <span id="step_1">Step1: Apply the mainfest. </span>
[root@localhost testing]# kubectl apply -f mainfest.yaml helmchart.helm.cattle.io/nginx created
After applying the mainfest is successful. Will see the following:
[root@localhost testing]# kubectl get helmchart -n kube-system nginx NAME JOB CHART TARGETNAMESPACE VERSION REPO HELMVERSION BOOTSTRAP nginx helm-install-nginx https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz default
[root@localhost testing]# kubectl get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx 1/1 1 1 3m34s
[root@localhost testing]# kubectl get pod NAME READY STATUS RESTARTS AGE nginx-5864d77f94-qlmtx 1/1 Running 0 3m11s
[root@localhost testing]# kubectl get job -n kube-system NAME COMPLETIONS DURATION AGE helm-install-traefik-crd 1/1 24s 3h9m helm-install-traefik 1/1 37s 3h9m helm-install-nginx 1/1 4s 4m2s
- ### <span id="step_2">Step 2: Delete the mainfest.</span>
[root@localhost testing]# kubectl delete -f mainfest.yaml helmchart.helm.cattle.io "nginx" deleted
After successfully deleting the mainfest. Will see the following:
[root@localhost testing]# kubectl get helmchart -n kube-system nginx Error from server (NotFound): helmcharts.helm.cattle.io "nginx" not found
[root@localhost testing]# kubectl get deployment No resources found in default namespace.
[root@localhost testing]# kubectl get pod No resources found in default namespace.
[root@localhost testing]# kubectl get job -n kube-system NAME COMPLETIONS DURATION AGE helm-install-traefik-crd 1/1 24s 3h21m helm-install-traefik 1/1 37s 3h21m helm-delete-nginx 1/1 4s 4m55s
- ### <span id="step_3">Step 3: Apply the same manifest again.</span>
[root@localhost testing]# kubectl apply -f mainfest.yaml helmchart.helm.cattle.io/nginx created
After successfully deleting the mainfest again. Will see the following:
[root@localhost testing]# kubectl get helmchart -n kube-system
NAME JOB CHART TARGETNAMESPACE VERSION REPO HELMVERSION BOOTSTRAP
nginx helm-install-nginx https://%{KUBERNETES_API}%/static/charts/nginx-15.14.2.tgz default
[root@localhost testing]# kubectl get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx 1/1 1 1 2m28s
[root@localhost testing]# kubectl get pod NAME READY STATUS RESTARTS AGE nginx-5864d77f94-lrhlc 1/1 Running 0 2m39s
[root@localhost testing]# kubectl get job -n kube-system NAME COMPLETIONS DURATION AGE helm-install-traefik-crd 1/1 24s 3h26m helm-install-traefik 1/1 37s 3h26m helm-delete-nginx 1/1 4s 9m23s helm-install-nginx 1/1 6s 3m3s
- ### <span id="step_4">Step 4: Delete the same manifest again.</span>
[root@localhost testing]# kubectl delete -f mainfest.yaml helmchart.helm.cattle.io "nginx" deleted
After successfully deleting the mainfest again. Will see the following:
[root@localhost testing]# kubectl get helmchart -n kube-system nginx Error from server (NotFound): helmcharts.helm.cattle.io "nginx" not found
[root@localhost testing]# kubectl get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx 1/1 1 1 9m32s [root@localhost testing]# kubectl get pod NAME READY STATUS RESTARTS AGE nginx-5864d77f94-lrhlc 1/1 Running 0 9m46s
[root@localhost testing]# kubectl get job -n kube-system NAME COMPLETIONS DURATION AGE helm-install-traefik-crd 1/1 24s 3h33m helm-install-traefik 1/1 37s 3h33m
I reproduced it in my environment. There is a high probability of reproducing it in my environment.
@brandond This situation is same to my test.We are certain that we can reproduce this issue 100%. Could this issue possibly be caused by different pkg versions that k3s and helm-controller both depend on?
Finally, I guess the root cause is found:
Below log confirmed my guess, will explain in following steps.
Apr 14 20:25:40 harv2 rke2[1997]: time="2024-04-14T20:25:40Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 14 20:27:02 harv2 rke2[1997]: time="2024-04-14T20:27:02Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"
Apr 14 20:27:08 harv2 rke2[1997]: time="2024-04-14T20:27:08Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"
func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, error) {
...
expectedJob, objs, err := c.getJobAndRelatedResources(chart)
...
--> (A) create a job
// note: on the logic of running an apply here...
// if the uninstall job does not exist, it will create it
// if the job already exists and it is uninstalling, nothing will change since there's no need to patch
// if the job already exists but is tied to an install or upgrade, there will be a need to patch so
// the apply will execute the jobPatcher, which will delete the install/upgrade job and recreate a uninstall job
err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
AllowClusterScoped: true,
}).
WithOwner(chart).
WithSetID("helm-chart-registration").
ApplyObjects(append(objs, expectedJob)...)
...
// sleep for 3 seconds to give the job time to perform the helm install
// before emitting any errors
time.Sleep(3 * time.Second)
...
-> (B) check job finished or not, and update object
// the first time we call this, the job will definitely not be complete... however,
// throwing an error here will re-enqueue this controller, which will process the apply again
// and get the job object from the cache to check again
if job.Status.Succeeded <= 0 {
// temporarily recreate the chart, but keep the deletion timestamp
chartCopy := chart.DeepCopy()
chartCopy.Status.JobName = job.Name
newChart, err := c.helms.Update(chartCopy)
if err != nil {
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName)
}
return newChart, fmt.Errorf("waiting for delete of helm chart for %s by %s", key, job.Name)
}
-> (C) log event
// uninstall job has successfully finished!
c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name)
-> (D) clean object, including the delete job
// note: an empty apply removes all resources owned by this chart
...
return chart, nil
}
K8s takes time to delete an object. In above, the OnRemove
may be called several times (typically 2 times). In the first call, it will create a job, remove resource, wait &/ log event, clean the job. If it is called again, it will repeat those steps. If it hit step (B) and returns on error, then a helm-delete job is left, and it will affect later helm delete
. If in other steps, then lucky.
In my local environment (Harvester cluster), I did see two times of helm-delete
object creating. The second helm-delete job just was left orphaned, as helm-chart
object was gone, and wrangler framework will not call OnRemove
for this helm-chart
object again.
Apr 14 20:27:08 harv2 rke2[1997]: time="2024-04-14T20:27:08Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"
It can be assumed that this is a sequence issue.
In Harvester, we added a workaroud , https://github.com/k3s-io/helm-controller/issues/177#issuecomment-1612865053, to delete the potential previous job explicitly.
The original helm-controller is a light-weight & handy tool to deploy helm chart, and very rarely the helm-chart will be deleted. Or there are long interval between creating and deleting of helm-chart object, the left helm-delete
job will be removed automatically.
But when it is used to handle the case of frequent creating/removing helm-chart, some enhancements are required.
@w13915984028 Thank you for your replay.
In Harvester, we added a workaroud , #177 (comment), to delete the potential previous job explicitly
it is useful to me.
There is another approach to solve this issue . First, disable the Helm controller option in k3s service. Then, run the Helm controller as a separate process.
The k3s service as follow:
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/usr/lib/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s \
server \
'--disable-helm-controller' \
The command for executing helm-controller is as follows. I download the pre-compiled binary for helm-controller and ensure that it is compatible with the K3s version.
./bin/helm-controller --kubeconfig /etc/rancher/k3s/k3s.yaml
@up-wei Change the system service to k8s pod/ independent process, may affect the required running time of the program, and hence jump the bug. But I suspect it can work in all cases.
I will add a PR, which will compare the creation timestamp of the job and the deletion timestamp of the helm-chart to avoid picking those old jobs.
I added some debug information and get below logs:
(1) The debug code https://github.com/w13915984028/helm-controller/commit/c0844edd5b220b78ee40e7392ceb9f7ee698a19a
We did several times of test, the helm-controller logs are:
(1) enable & disable rancher-logging (backed by helmchart); normal
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:23:44 harv41 rke2[29380]: time="2024-04-19T11:23:44Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:25:08 harv41 rke2[29380]: time="2024-04-19T11:25:08Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:25:08 harv41 rke2[29380]: time="2024-04-19T11:25:08Z" level=info msg="Get job 1, no job"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:25:11 harv41 rke2[29380]: time="2024-04-19T11:25:11Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"
Apr 19 11:25:14 harv41 rke2[29380]: time="2024-04-19T11:25:14Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:25:08 +0000 UTC"
(2) enable & disable again; a helm-delete job was re-created and left orphaned
Apr 19 11:26:52 harv41 rke2[29380]: time="2024-04-19T11:26:52Z" level=warning msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:56440: write: connection reset by peer"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:26:55 harv41 rke2[29380]: time="2024-04-19T11:26:55Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
// OnRemove
Apr 19 11:27:41 harv41 rke2[29380]: time="2024-04-19T11:27:41Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:27:41 harv41 rke2[29380]: time="2024-04-19T11:27:41Z" level=info msg="Get job 1, no job"
// helm-delete job is generated by generic.ConfigureApplyForObject
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"
// wait job to finish, till next round
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: waiting for delete of helm chart for cattle-logging-system/rancher-logging by helm-delete-rancher-logging, requeuing"
// OnRemove again
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
Apr 19 11:27:44 harv41 rke2[29380]: time="2024-04-19T11:27:44Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:41 +0000 UTC"
...
// job is done, and log event ....
// due to the slow-deletion of k8s, the `OnRemove` was called again
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
// above job was gone
Apr 19 11:27:47 harv41 rke2[29380]: time="2024-04-19T11:27:47Z" level=info msg="Get job 1, no job"
// create a new job
Apr 19 11:27:50 harv41 rke2[29380]: time="2024-04-19T11:27:50Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"
// update to `chart` object failed, as `chart` object was gone on apiserver, and this job is left orphaned
Apr 19 11:27:50 harv41 rke2[29380]: time="2024-04-19T11:27:50Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, requeuing"
Apr 19 11:37:22 harv41 rke2[29380]: time="2024-04-19T11:37:22Z" level=warning msg="Proxy error: write failed: write tcp 127.0.0.1:9345->127.0.0.1:55328: write: connection reset by peer"
(3) enable & disable again, the old helm-delete job was picked, and no true deleting
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rancher-logging\" not found, requeuing"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
Apr 19 11:40:14 harv41 rke2[29380]: time="2024-04-19T11:40:14Z" level=info msg="Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging"
// OnRemove
Apr 19 11:41:18 harv41 rke2[29380]: time="2024-04-19T11:41:18Z" level=info msg="OnRemove will try remove job cattle-logging-system helm-delete-rancher-logging"
// the above left job is still here
Apr 19 11:41:18 harv41 rke2[29380]: time="2024-04-19T11:41:18Z" level=info msg="Get job 1, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"
// generic.ConfigureApplyForObject cann't identify that this is an old job and a new one is needed
Apr 19 11:41:21 harv41 rke2[29380]: time="2024-04-19T11:41:21Z" level=info msg="Get job 2, the creationTimeStamp is 2024-04-19 11:27:47 +0000 UTC"
helmchart
object is gone, but real resources are left ...
Events, 3 times of ApplyJob and RemoveJob
harv41:/home/rancher # kk get events -A | grep ApplyJob
cattle-logging-system 31m Normal ApplyJob helmchart/rancher-logging Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging
cattle-logging-system 28m Normal ApplyJob helmchart/rancher-logging Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging
cattle-logging-system 15m Normal ApplyJob helmchart/rancher-logging Applying HelmChart 1 using Job cattle-logging-system/helm-install-rancher-logging
harv41:/home/rancher # kk get events -A | grep RemoveJob
cattle-logging-system 30m Normal RemoveJob helmchart/rancher-logging Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:25:08 +0000 UTC
cattle-logging-system 28m Normal RemoveJob helmchart/rancher-logging Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:27:41 +0000 UTC
cattle-logging-system 14m Normal RemoveJob helmchart/rancher-logging Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging, removing resources, CreationTimestamp 2024-04-19 11:27:47 +0000 UTC
The last event, it shows clearly that the old job was picked and used
- apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2024-04-19T11:41:21Z"
involvedObject:
apiVersion: helm.cattle.io/v1
kind: HelmChart
name: rancher-logging
namespace: cattle-logging-system
resourceVersion: "921436"
uid: 86b7dbc0-23b4-400f-ac49-6ff6a5035af6
kind: Event
lastTimestamp: "2024-04-19T11:41:21Z"
message: Uninstalled HelmChart 1 using Job cattle-logging-system/helm-delete-rancher-logging,
removing resources, CreationTimestamp 2024-04-19 11:27:47 +0000 UTC
// the job CreationTimestamp 2024-04-19 11:27:47 shows this job is old, it is far before the event firstTimestamp: "2024-04-19T11:41:21Z"
metadata:
creationTimestamp: "2024-04-19T11:41:21Z"
name: rancher-logging.17c7ab9eac0ea147
namespace: cattle-logging-system
resourceVersion: "921488"
uid: 925e455a-b5f4-40c6-8ff0-11005ae6db07
reason: RemoveJob
reportingComponent: helm-controller
reportingInstance: harv41
source:
component: helm-controller
host: harv41
type: Normal
The PODs are left:
harv41:/home/rancher # kk get pods -n cattle-logging-system
NAME READY STATUS RESTARTS AGE
harvester-default-event-tailer-0 1/1 Running 0 17m
rancher-logging-77f858bbb7-n7vlm 1/1 Running 0 17m
rancher-logging-kube-audit-fluentbit-dlf52 1/1 Running 0 17m
rancher-logging-kube-audit-fluentd-0 2/2 Running 0 17m
rancher-logging-kube-audit-fluentd-configcheck-ac2d4553 0/1 Completed 0 17m
rancher-logging-rke2-journald-aggregator-499bk 1/1 Running 0 17m
rancher-logging-root-fluentbit-dvvcj 1/1 Running 0 17m
rancher-logging-root-fluentd-0 2/2 Running 0 17m
rancher-logging-root-fluentd-configcheck-ac2d4553 0/1 Completed 0 17m
harv41:/home/rancher # kk get helmchart -A
NAMESPACE NAME JOB CHART TARGETNAMESPACE VERSION REPO HELMVERSION BOOTSTRAP
kube-system rke2-canal helm-install-rke2-canal true
kube-system rke2-coredns helm-install-rke2-coredns true
kube-system rke2-ingress-nginx helm-install-rke2-ingress-nginx false
kube-system rke2-metrics-server helm-install-rke2-metrics-server false
kube-system rke2-multus helm-install-rke2-multus true
harv41:/home/rancher #
enable rancher-logging:
harv41:/home/rancher # enable-rl
addon.harvesterhci.io/rancher-logging patched
harv41:/home/rancher # kk get helmchart -A
NAMESPACE NAME JOB CHART TARGETNAMESPACE VERSION REPO HELMVERSION BOOTSTRAP
cattle-logging-system rancher-logging helm-install-rancher-logging rancher-logging 103.0.0+up3.17.10 http://harvester-cluster-repo.cattle-system.svc/charts
kube-system rke2-canal helm-install-rke2-canal true
kube-system rke2-coredns helm-install-rke2-coredns true
kube-system rke2-ingress-nginx helm-install-rke2-ingress-nginx false
kube-system rke2-metrics-server helm-install-rke2-metrics-server false
kube-system rke2-multus helm-install-rke2-multus true
harv41:/home/rancher #
The return error (with added debug) in step below is:
Apr 19 17:31:16 harv41 rke2[30786]: time="2024-04-19T17:31:16Z" level=error msg="error syncing 'cattle-logging-system/rancher-logging': handler on-helm-chart-remove: unable to update status of helm chart to add uninstall job name helm-delete-rancher-logging, err Operation cannot be fulfilled on helmcharts.helm.cattle.io \"rancher-logging\": StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"
It meets our analysis, the helm-chart
object in apiserver is gone, but the controller is still trying to update it with the re-created job.
https://pkg.go.dev/k8s.io/apiserver/pkg/storage#StorageError
That matches the comment on line 267: https://github.com/k3s-io/helm-controller/blob/255f905f22ad5d303a387dae04abc7c3395283dd/pkg/controllers/chart/chart.go#L266-L270
We know that the HelmChart resource has been deleted, but we're trying to update it anyway. We can't update it if it doesn't exist though, so the comment about calling update to "temporarily recreate the chart" doesn't make sense; thats not something you can do.
This seems to have come from https://github.com/k3s-io/helm-controller/pull/158 - @aiyengar2 can you take a look and make sure we're not missing anything?
My doubt is:
The current OnRemove is as following, it doesn't know if a helm-delete job has already been created and done, it does not use a state to record & control it; instead, it simply loop
This, by nature, will generate orphaned job possibly.
OnRemove
{
create job if not existing; also not checking chartCopy.Status.JobName (means already has job ?)
sleep 3s
job not succes? update chart.Status.JobName (does not check if it is already there), return error to trigger next round
job success? emit event, clean resource (including the job)
}
When I run job is still existing, and it seems it is caused by duplicated helm-delete job created by helm-controller
kubectl delete helmchart <name>
command, most of time I see the helm-delete-For example, run
kubectl delete helmchart nats
,The job and pod for helm delete are existing,
The event show the pod for /helm-delete-nats generated twice
And the existing pod logs are as following, I think it means the job does not find the helm release to delete.