helm / helm

The Kubernetes Package Manager
https://helm.sh
Apache License 2.0
27.07k stars 7.12k forks source link

Helm should delete job after it is successfully finished #1769

Closed dzavalkinolx closed 7 years ago

dzavalkinolx commented 7 years ago

I have a job bind to post-install,post-upgrade,post-rollback

    "helm.sh/hook": post-install,post-upgrade,post-rollback

when i’m running update charts i get Error: UPGRADE FAILED: jobs.batch "opining-quetzal-dns-upsert" already exists kubectl get jobs returns

opining-quetzal-dns-upsert   1         1            12m

So, how we are supposed to use jobs as hooks if it is not deleted after it is successfully finished? There is no way to update chart if it has such job.

thomastaylor312 commented 7 years ago

I would love to see this and I will try to get around to submitting a PR for it if I have the time. In the meantime, I tag this on to the end of the job name as a workaround:

metadata:
  name: {{ template "fullname" . }}-{{ randAlphaNum 5 | lower }}
dzavalkinolx commented 7 years ago

I'm currently using workaround from @thomastaylor312 and it is very-very-very bad to do it this way. There are 2 issues:

  1. If job wasn't finished successfully (by whatever reason) - it will be rescheduled again and again and again. I think we need something like a maxRetry parameter here.
  2. If chart is deleted - job is not deleted by helm. So it will continue to be scheduled and fail indefinitely.

As a result I've just run into a performance issue with a storm of kill container / start container events.

thomastaylor312 commented 7 years ago

@dzavalkinolx: We should probably bring this up in the kubernetes issues. That isn't so much a problem with Helm as it is with how Kubernetes Jobs work. Have you also tried setting restartPolicy: OnFailure so that it just restarts the container instead of creating a new pod?

technosophos commented 7 years ago

This is a tricky one. Helm doesn't implicitly manage the lifecycles of things like this. In particular, it never deletes a user-created resource without receiving an explicit request from a user.

Now, we recently introduced an annotation called resource-policy that we could use for something like this. Currently, the only defined resource policy is keep, which is used to tell Tiller not to delete that resource on a normal deletion request. I suppose we could implement another policy with something like delete-on-completion that deleted a Pod or Job on completion.

Since Tiller does not actively monitor resources once they are deployed, I'm not sure this would be a terribly powerful annotation, but it could work on hooks because we do watch hooks for lifecycle events.

longseespace commented 7 years ago

@dzavalkinolx Have you tried activeDeadlineSeconds?

thomastaylor312 commented 7 years ago

@longseespace I tried that too, but it just kills the job after the amount of time and spins up a new one

javiercr commented 7 years ago

I ran into this problem today too. I found out about Helm Hooks recently and I thought it would be perfect to implement pre-upgrade database migrations (rake db:migrate, for those of you familar with Rails). But then I found that Helm does not clean up completed jobs, so this doesn't work on a CI environment.

I'm probably missing something, but what's the point of having a job executed in a pre-upgrade / post-upgrade hook if once the job gets executed the first time you do helm upgrade, the next one it will fail?

technosophos commented 7 years ago

That's why we recommend appending a random string to a job name if you know for sure you are going to re-run a job again.

javiercr commented 7 years ago

I see, I'm not sure I like the approach of leaving behind a new successful job for every deployment. Right now what I've done is adding a new step to our deployment script that does kubectl delete job db-migrate-job --ignore-not-found before our helm upgrade.

johnw188 commented 7 years ago

It feels like helm should be managing the full lifecycle of its hooks, as they're documented as the approach to take when it comes to executing lifecycle events such as migrations. Someone without a strong understanding of kubernetes could end up with hundreds of useless jobs.

I almost feel like the ideal solution here would just be to have tiller delete an old job if it hits a name conflict with a hook job it attempts to create. You could verify that the job was required by helm by ensuring that the correct hook annotation is present. You could also put this behavior behind a command line flag, such as --overwrite-hooks with a better error message for users:

Error: UPGRADE FAILED: jobs.batch "opining-quetzal-dns-upsert" already exists
Rerun your upgrade with --overwrite-hooks to automatically replace existing hooks
docmatrix commented 7 years ago

I am having the exact same challenges with a python / django app. Does anyone have a mechanism such that helm will abort an upgrade if the pre-upgrade job fails?

thomastaylor312 commented 7 years ago

I may take this once I get some other work done for 2.5. Assigning to myself for now, if someone else wants to take it before I work on it, let me know

DoctorZK commented 7 years ago

@thomastaylor312 Have you finished this feature yet? If not, I would like to take this work.

thomastaylor312 commented 7 years ago

@DoctorZK Feel free to take it. Thank you for offering to do it!

gianrubio commented 7 years ago

What about a simple annotation helm.sh/resource-policy: delete-job-after-run? When the job successfully run, helm delete this job.

DoctorZK commented 7 years ago

Good suggestion. I have thought out two approaches to solve this problem.

  1. Add a simple annotation in the hook templates, which is the easiest to implement. However, it can not solve this kind of problem: helm fails during the pre-install/pre-upgrade process, but users try to install/upgrade the release with the same chart again, which will incur resource objects name conflict in K8S. Therefore, with the approach, we should add another annotation, such as helm.sh/resource-policy: delete-job-if-job-fails .

  2. Add flags after install/upgrade/rollback/delete commands (e.g., upgrade $release_name --include-hooks) which can solve name conflict problem, however, it will also remove some kinds of hooks that users are not intended to delete, such as configmaps and secrets that are designed to use repeatedly by different versions of the same release.

I prefer the first one, which can control hooks with a finer granularity.

gianrubio commented 7 years ago

@DoctorZK as you suggests I would like to have another annotation helm.sh/resource-policy: delete-job-if-succeed. This is important when you're deleting a helm and have a job for cleanup, for now it's not possible to delete this job without running another job.

Are you willing to work on this? if not I can take care of it

DoctorZK commented 7 years ago

Thanks for your help. I have finished the coding process, and now is under test. I will submit the pull request as soon as possible.

libesz commented 7 years ago

For those who want to have a workaround for this, until the final solution is implemented. Instead of appending random characters to the Job object's name, the Job may delete itself from the APIserver as the last task before exiting. It is equally "elegant" but not leaking Job objects. Sad facts:

paulwalker commented 7 years ago

Can we get a quick run-down here on how this works while documentation is still forthcoming?

paulwalker commented 7 years ago

Specifically, I am interested in understanding how to add/run/monitor/update a job to an existing helm deployment. Thx!

DoctorZK commented 7 years ago

For executing a job, you can define the job as pre-upgrade/post-upgrade hook and upgrade the release.

paulwalker commented 7 years ago

OK, I was having an issue with a post-upgrade job continually failing (my issue, obviously) and restarting despite restartPolicy: Never. I'm sorry if I'm not comprehending this thread, but am I to understand that that behavior is now "fixed" with the canary build. Or do I need to set some sort of configuration?

Here is my current configuration:

{{- if .Values.job.enabled -}}
apiVersion: batch/v1
kind: Job
metadata:
  name: "{{.Release.Name}}"
  labels:
    heritage: {{.Release.Service | quote }}
    release: {{.Release.Name | quote }}
    chart: "{{.Chart.Name}}-{{.Chart.Version}}"
  annotations:
    "helm.sh/hook": post-install,post-upgrade
spec:
  template:
    metadata:
      name: "{{.Release.Name}}"
      labels:
        heritage: {{.Release.Service | quote }}
        release: {{.Release.Name | quote }}
        chart: "{{.Chart.Name}}-{{.Chart.Version}}"
    spec:
      restartPolicy: Never
      containers:
      - name: importer
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        command: ["{{.Values.job.binary}}","{{.Values.job.cmd}}"]
{{- end -}}
thomastaylor312 commented 7 years ago

There is a new annotation for it. The hook docs have been updated on master

ptagr commented 7 years ago

Documented here - https://github.com/kubernetes/helm/blob/master/docs/charts_hooks.md Adding this annotation worked for me: "helm.sh/hook-delete-policy": hook-succeeded

k8vance88 commented 7 years ago

I downloaded Helm v 2.6.2 and looked at the source and "helm.sh/hook-delete-policy" doesn't seem to be in there. Do you know when this will make it into a release?

thomastaylor312 commented 7 years ago

@k8vance88 This will be landing in 2.7. We will be releasing an RC of it soon once we have the k8s 1.8 support merged in

DonMartin76 commented 7 years ago

@thomastaylor312 Did this land in 2.7?

bacongobbler commented 7 years ago

yes, everything currently in master landed in 2.7.

macropin commented 7 years ago

Just to be clear... if I add "helm.sh/hook-delete-policy": hook-succeeded to my job. Then every deployment should recreate and re-run that job? Because that's not what I'm seeing here.

thomastaylor312 commented 7 years ago

@macropin Is your hook defined as a post-install,post-upgrade (or pre as your case may be)? If it only has the post-install it will only run the first time

macropin commented 7 years ago

@thomastaylor312 It's defined as post-install,post-upgrade, so you're saying it should be working?

thomastaylor312 commented 7 years ago

It will create a new object (generally a Pod or Job) each time you release. If you use the feature mentioned here, it will delete the job when it is done running. If for some reason a hook isn't deploying, it would be a separate issue

macropin commented 7 years ago

The job has the following annotations:

      annotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "5"
        "helm.sh/hook-delete-policy": hook-succeeded,hook-failed

The job only runs once on the first install, and never again on subsequent upgrades. Running Helm v2.7.0. Should I create a separate issue for this?

thomastaylor312 commented 7 years ago

@macropin Yes. Could you please create another issue with details about your cluster and, if possible, an example chart that duplicates the issue

sohel2020 commented 5 years ago

The job has the following annotations:

      annotations:
        "helm.sh/hook": post-install,post-upgrade
        "helm.sh/hook-weight": "5"
        "helm.sh/hook-delete-policy": hook-succeeded,hook-failed

The job only runs once on the first install, and never again on subsequent upgrades. Running Helm v2.7.0. Should I create a separate issue for this?

@macropin How did you solve it? I'm facing a similar issue. It never creates job every subsequent upgrade. I'm using the same annotation as yours.

helm version: v2.14.3

guice commented 4 years ago

@sohel2020 and @macropin - Same board, helm v3. The job is never re-ran on subsequent upgrades.

thecrazzymouse commented 4 years ago

What is the solution for rerunning jobs on helm v3?

renepardon commented 4 years ago

Same problem with Helm version: version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}

Jobs are neither deleted nor run on subsequent upgrade commands.

jabdoa2 commented 4 years ago

We regularly hit this one too in Helm 3.1.

paologallinaharbur commented 3 years ago

In the future I believe we will be able to rely as well on job TTL

schollii commented 2 years ago

@paologallinaharbur only partially: for example if job ttl is 5 min, and you make a commit before and the previous commit caused job to fail so it is still there, you will have same issue

The two options that have worked for me:

  1. before a helm upgrade, run kubectl delete job;
  2. use the "helm.sh/hook-delete-policy": hook-succeeded,hook-failed policy, which is a better approach than item 1 BUT this approach will drop the logs of failed job which could be detrimental in some cases for troubleshooting

The best option would be to have a hook policy that is applied before a hook is run, eg something like

annotations:
  "helm.sh/hook-delete-policy": previous-hook-failed
joelmathew003 commented 10 months ago

What is the solution for rerunning jobs on helm v3?

@thecrazzymouse Were you able to find a solution?