hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.82k stars 1.94k forks source link

Purging a parameterized job does not purge or unlink children jobs #6970

Open DingoEatingFuzz opened 4 years ago

DingoEatingFuzz commented 4 years ago

Nomad version

0.10.2

Operating system and Environment details

MacOS, dev agent

Issue

When purging a parameterized job, all children jobs of the parameterized job will still maintain parent references to the now purged job. This creates broken references which complicates any time of job traversal.

Reproduction steps

  1. Run any ol' parameterized job.
  2. Dispatch some instances (children) of the parameterized job
  3. Purge the parameterized job (nomad stop -purge my-parameterized-job)
  4. Observe that the child job is still there in the CLI and API responses.
ID                                     Type                 Priority  Status 
geocoder                               batch/parameterized  50        running  
geocoder/dispatch-1579658083-5adaf751  batch                50        dead     

becomes

ID                                     Type                 Priority  Status 
geocoder/dispatch-1579658083-5adaf751  batch                50        dead     

with an API response including

  "ID": "geocoder/dispatch-1579658083-5adaf751",
  "ParentID": "geocoder",
  "Name": "geocoder/dispatch-1579658083-5adaf751",

What was expected

One of two things should have happened.

1. The child job should have also been purged

Since the job was already in a terminal state, this would have been the same effect as a GC and it would have kept the job graph tidy.

This gets more complicated when there are running instances of the parameterized job, but hey, purge means purge, right?

2. The child job should have been unreferenced from the parent

As part of the purge, the children of a job can be walked and unlinked from the parent. This is just a change in metadata. Child jobs are still just jobs as far as the scheduler is concerned, but in this way, the job graph isn't left in a broken state.

mwantia commented 2 years ago

Was this forgotten, are there any changes or plans for this bug? It honestly looks kind of embarrassing to suddenly see over 2000 dead jobs and having no way to remove them...

Edit: For anyone who might face the same problem and doesn't want to purge every job by hand, you should be able to purge all of them with this small script:

#!/bin/bash
nomad status | awk '/^'${1}'/' | awk '{ print $1 }' | while read line 
do
   nomad stop -purge ${line}
done

Save it as a file (for example purge-periodic-jobs.sh) make it executable and insert the name of the parent job. Example: ./purge-periodic-jobs.sh name-of-the-batch-job-to-purge

josegonzalez commented 2 years ago

I'm still seeing this in Nomad 1.2.8+ent. We use namespaced deploys to allow for review-app style testing and thus we have a ton of child jobs that we need to purge now.

Allan-Nava commented 6 months ago

Is possible to prevent the purge of the job?