Open pkajaba opened 4 years ago
Not unless we patch clean_registries()
to also check for worker PIDs to determine if a job is still being worked on. Would love to have a PR for this.
would we have to introduce worker_id:job_id mapping or it can be done without it?
We should store in Redis as part of Job
's data. Should be something like job.worker_pid
I think this is relevant for jobs that get lost when a worker is abruptly terminated. A worker can be killed by Docker at any time without warning, so if the worker was running a long job during that time, it wouldn't register as failed and wouldn't retry I think.
RQ has a process that sweeps started jobs and checks whether they’re still alive after their timeout has ended. Jobs that are harvested this way will be marked as failed and retried (if set).
I wonder if there is any update on when this issue might be resolved (e.g. by #1372), or if some additional contributions might be needed. Getting a fix for this in place would represent a significant quality-of-life improvement for one of my projects, since container eviction can currently result in users' long-running tasks spending a significant amount of time in a "zombie state" before the failure is detected and rectified.
If we want to detect long running jobs abnormally terminating early, we'll need to have a way to check whether the worker running the job is still alive.
One way to do this would be to check the worker running the job's last heartbeat. I'd be happy to accept a PR for this.
How I think we should do this:
job.worker_is_still_alive()
that returns whether the worker running the job is still activerq worker
that tells it to also check for long running job's status when running maintenance tasks
When I am running rq worker in docker container and container is kill with docker kill, worker seems to be registered on redis as still working.
Is there any way how to mark started job as failed even without waiting it will reach it's timeout? I tried to call
clean_registries
on worker start, but it does not seem to be doing any trick.Thanks!