rq / rq

Simple job queues for Python
https://python-rq.org
Other
9.89k stars 1.42k forks source link

Jobs are marked as started when docker container is killed #1164

Open pkajaba opened 4 years ago

pkajaba commented 4 years ago

When I am running rq worker in docker container and container is kill with docker kill, worker seems to be registered on redis as still working.

Is there any way how to mark started job as failed even without waiting it will reach it's timeout? I tried to call clean_registries on worker start, but it does not seem to be doing any trick.

Thanks!

selwin commented 4 years ago

Not unless we patch clean_registries() to also check for worker PIDs to determine if a job is still being worked on. Would love to have a PR for this.

pkajaba commented 4 years ago

would we have to introduce worker_id:job_id mapping or it can be done without it?

selwin commented 4 years ago

We should store in Redis as part of Job's data. Should be something like job.worker_pid

levkk commented 3 years ago

I think this is relevant for jobs that get lost when a worker is abruptly terminated. A worker can be killed by Docker at any time without warning, so if the worker was running a long job during that time, it wouldn't register as failed and wouldn't retry I think.

selwin commented 3 years ago

RQ has a process that sweeps started jobs and checks whether they’re still alive after their timeout has ended. Jobs that are harvested this way will be marked as failed and retried (if set).

noahg2 commented 3 years ago

I wonder if there is any update on when this issue might be resolved (e.g. by #1372), or if some additional contributions might be needed. Getting a fix for this in place would represent a significant quality-of-life improvement for one of my projects, since container eviction can currently result in users' long-running tasks spending a significant amount of time in a "zombie state" before the failure is detected and rectified.

selwin commented 3 years ago

If we want to detect long running jobs abnormally terminating early, we'll need to have a way to check whether the worker running the job is still alive.

One way to do this would be to check the worker running the job's last heartbeat. I'd be happy to accept a PR for this.

How I think we should do this:

  1. Create a job.worker_is_still_alive() that returns whether the worker running the job is still active
  2. Introduce a command line option to rq worker that tells it to also check for long running job's status when running maintenance tasks
  3. If a job is dead, move it to FailedJobRegistry