helm / helm

The Kubernetes Package Manager
https://helm.sh
Apache License 2.0
26.76k stars 7.07k forks source link

Using the "helm.sh/resource-policy" annotation causes automated rollbacks to consistently fail the first time they are attempted #13142

Open gibsondan opened 2 months ago

gibsondan commented 2 months ago

We have a secret defined as follows with the "helm.sh/resource-policy" annotation, so that Helm leaves it around even after it would normally remove it:

apiVersion: v1
kind: Secret
metadata:
  name: my-secret-{{ .Values.version }}
  labels:
    app.kubernetes.io/managed-by: {{ .Release.Service }}
  annotations:
    "helm.sh/resource-policy": keep

We are finding that the presence of this annotation reliably prevents automated rollbacks from working. If a command like

helm upgrade our-cloud . --atomic --install --debug --timeout 5m --values ./values.yaml

times out and the --atomic flag causes it to try to roll back, we see the following error about being unable to find the secret in our logs, 100% of the time (even though we are certain that the secret does in fact exist - it was never removed from the previous Helm deploy due to the annotation, so it may no longer be in Helm's known list of secrets, but it does exist in the cluster):

wait.go:50: [debug] beginning wait for 43 resources with timeout of 5s
...
upgrade.go:476: [debug] warning: Upgrade "our-cloud" failed: context deadline exceeded
upgrade.go:494: [debug] Upgrade failed and atomic is set, rolling back to last successful release
history.go:56: [debug] getting history for release our-cloud
rollback.go:65: [debug] preparing rollback of our-cloud
rollback.go:131: [debug] rolling back our-cloud (current: v6292, target: v6291)
rollback.go:72: [debug] creating rolled back release for our-cloud
rollback.go:78: [debug] performing rollback of our-cloud
client.go:393: [debug] checking 43 resources for changes
...
rollback.go:195: [debug] warning: Rollback "our-cloud" failed: no Secret with the name "my-secret-780bdef1-9ba1f795" found
Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: context deadline exceeded: no Secret with the name "my-secret-780bdef1-9ba1f795" found
helm.go:84: [debug] no Secret with the name "my-secret-780bdef1-9ba1f795" found

This happens 100% of the time if we have the "helm.sh/resource-policy" set on that secret, but never happens if we remove the annotation. Interestingly it also goes away if we then attempt a manual rollback to the same revision that the automated rollback attempted - so its only the first time that the rollback is attempted to a given revision that it fails with this "secret not found" error.

Similarly, if we disable --atomic but keep --wait and --timeout so that the upgrade fails without automatically rolling back, the first manual rollback that we attempt to the previous revision will fail with the same "secret not found" error, but then rolling back to that revision again will succeed, without us making any changes.

It is possible that this is the same underlying issue as https://github.com/helm/helm/issues/12436 - maybe the annotation causes helm to treat this secret similarly to a resource that was created manually.

Output of helm version: 3.15.2

Output of kubectl version:

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b

Cloud Provider/Platform (AKS, GKE, Minikube etc.):

mattfarina commented 2 months ago

I've not tried to reproduce it, yet. I've looked over the code, quickly, and I don't see what would be causing this. Marking it as a bug. The next step is to try to reproduce it.

dayeguilaiye commented 1 month ago

@mattfarina I've reproduced this issue.

  1. Execute helm install testChart /path/to/chartv1 --wait --timeout=10s, chartv1 looks like below, and command success.
    • a normal deployment
    • a configmap with "helm.sh/resource-policy": keep
  2. Execute helm upgrade testChart /path/to/chartv2 --wait --timeout=10s, chartv1 looks like below, and command failed.
    • a wrong deployment(with a image which is not exist) to cause upgrading failure
    • NO configmap anymore
  3. Execute helm rollback testChart, will failed, and says Error: no ConfigMap with the name "aaa" found

After read code of helm, and found the reason:

https://github.com/helm/helm/blob/a958f3a9b65df7c809c88c180cc89886fee7768c/pkg/kube/client.go#L394-L429

When executing the rollback command, it will call the Update(original, target ResourceList, force bool) function. And the target resourceList contains aaa configmap, while original ResourceList not.

Then call the helper.Get, the err is nil, because the aaa configmap really exists in k8s.

But when calling the original.Get(info), there is no aaa configmap in original, so went to the not found error.

I think, when helper.Get success, but original.Get failed, maybe we should use the result of helper.Get, instead of throw an error.

code seems like this:

var originalObject runtime.Object
if got, err := helper.Get(info.Namespace, info.Name); err != nil {
    ...
}else {
    originalObject = got.Object
}
originalInfo := original.Get(info)
if originalInfo != nil {
    originalObject = originalInfo.Object
}
if err := updateResource(c, info, originalObject, force); err != nil {
...

I'm not sure if this a good solution, If you think this might work, you can assign this issue to me.I'll rise a PR to fix this issue