Worker process shutdown deadline

airhorns commented 5 years ago

It's great that the que worker process gives itself a bit of time upon receiving SIGTERM to finish the outstanding jobs. If those jobs are short, there's not much sense in throwing away all the work they've done so far if they're just about to complete.

With long running jobs however (like CSV exports -> S3 uploads, data api extracts, etc), the worker process ends up refusing to shut down if it's running any of them. This is more of an ergonomics thing, but, when developing locally, it's annoying to have to kill -9 the que process if it's working on anything long running, and in production the orchestrator can't tell the difference between que being stuck or hung for some actually important issue, or if it was just working on a long running job that would be fine to abort.

I think it'd be great if que killed it's own workers after getting SIGTERM and waiting x seconds so that it could exit cleanly and let developers writing quote-unquote bad long running jobs restart the process easily. See https://github.com/mperham/sidekiq/wiki/Signals#term for sidekiq's config param that accomplishes the same thing.

siegy22 commented 5 years ago

I agree on the pain of kill -9, on the other hand to have an arbitrary waiting time before killing the workers is also not optimal.

What I would preffer is to have the following:

When sending Ctrl+c the first time, try to gracefully shutdown the workers. The second time however, just kill them. This way, you as a user can decide if it's time to really kill them. And this takes away the pain of kill -9.

Que waiting for jobs...
*hits Ctrl+c*
Finishing Que's current jobs before exiting...
Hit Ctrl+c to force shutdown.
*hits Ctrl+c again*

airhorns commented 5 years ago

From a development environment perspective that would work fine for me, but, I forgot to mention the production use case for a feature like this also though, which that would work less well for.

I think it's valuable to know if your workers are shutting down "cleanly", that is, as expected, and right now if you have any long running jobs they have to be killed by the production process orchestrator. In the case of Kubernetes for example, pods that have to be SIGKILL'd after not terminating themselves by the grace period show up in a special state that alerts operators to their unclean shutdown behaviour. K8S sends SIGTERM, then waits a grace period, then sends SIGKILL as opposed to sending two SIGTERMS, so it'd be hard to avoid this scenario of the pods being marked as unclean shutdowns. From my recollection runit and other process monitors work pretty similarly.

So I guess the decision is between developer ergonomics and predictability: double SIGTERM is nicer than the status quo for users where they can make a decision about when to kill again, but forcible job killing after a timeout is better in production where there is no user to make a decision about when to kill such that it needs to be configured. Agree?

siegy22 commented 5 years ago

I do agree. I feel like the nicest behaviour would be to be able to hit double Ctrl + c but the workers also automatically get killed after let's say 30 seconds. So you have the choice in development mode to force shutdown right away, and in production you can send one SIGTERM and after 30 seconds the system should shutdown.

lawrencejones commented 4 years ago

Fwiw, we've been running a pretty old fork of Que where we had to support this type of shutdown.

We decided to add a flag to the Que binary, --timeout , that specified the maximum time you should wait for Que jobs to stop. Upon receiving SIGTERM, we immediately request that a worker stops processing new work.

Our worker scheduler then waits on each thread to finish, timing out after timeout. If we hit a timeout, we'll call Thread.raise on the worker thread, which allows the worker to mark the job it was currently working as having failed due to a timeout, and make the job available for a retry whenever a new worker appears.

You can see the implementation of our stop sequence here: https://github.com/gocardless/que/blob/master/lib/que/worker_group.rb#L29-L60

This managed to make our job timeouts visible to developers, ensuring jobs that hit the timeout were reported upon correctly. Our retry strategies could safely ignore JobTimeoutError too. In all, the addition of this feature was really successful and the caveats with the unsafe Thread.raise API definitely a lesser evil that whatever scheduler you're using (Kubernetes, docker, etc) silently terminating your jobs midway.

airhorns commented 4 years ago

That patch looks awesome to me! I think the Thread.raise concerns are real but as noted probably fine because the process is in the process of ending anyways. @lawrencejones do you run Que with a timeout slightly shorter than the scheduler's termination grace period?

I also feel like its wise to put the behaviour under a command line flag that isn't set by default to preserve the old default behaviour.

lawrencejones commented 4 years ago

Yep, we ensure all workers have a timeout set to be about 5s less than the kubernetes graceful termination period. That way you give enough space for the job to do it's error handling and finish independently, unless something really odd has happened in which case you're probably not going to successfully work it anyway.

siegy22 commented 4 years ago

@lawrencejones So you have been using this fork with this shutdown feature in production?

lawrencejones commented 4 years ago

Yep, I think we've processed upwards of 1B jobs through this implementation. The only caveat is that stack traces can occasionally be strange, but the added visibility of jobs being terminated is well worth it.

Also, most jobs should retry on this error automatically, and for retryable errors you tend not to alert anyone. So the noise goes unnoticed.

1c7 commented 3 years ago

Hi, any update? set deadline or timeout on a job.

que-rb / que

Worker process shutdown deadline #254