Closed majorov-daniil closed 1 month ago
I understand that this might be a niche scenario
No it's a fully fledged bug you've found, and it deserves being fixed as much as any other bug.
So from my understanding, a minimal reproduction code would be:
import procrastinate
app = procrastinate.ProcrastinateApp(connector=procrastinate.PsycopgConnector())
@app.task(pass_context=True, retry=True)
async def mytask(context):
await await app.job_manager.cancel_job_by_id_async(context.job.id)
0/0
async def main():
await mytask.defer_async()
await app.run_worker_async()
asyncio.run(main())
(note: haven't tested this)
There are 2 things we should do:
Job was not found or not in "doing" status
crashes the workerI will provide a fix as soon as possible.
- This case should be expected and we should react appropriately
It should be enough to also allow aborting
in the procrastinate_retry_job function. This will need another migration. Is there anything else you'd be able to think of?
- And it's probably not expected that the error that triggers
Job was not found or not in "doing" status
crashes the worker- (add the above example as a test)
Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).
When I think about it ... the job should maybe not be retried then, but nevertheless aborted (the abortion request came earlier than the error). That way we wouldn't need an additional migration. Both ways seem acceptable to me. @ewjoachim What do you think?
And when I even think more about it 😉 ... the migration is a better solution. The user may only want to handle the abortion in well-defined states by himself. So, I would respect this by just doing the retry then.
Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).
I think it could go either way, but I think it might be possible that without a bug from procrastinate, the user would find themselves in a situation where the job couldn't be found anymore (or in the correct state), and while I would expect this to be noisy, crashing the worker for that would, itself, be the bug more than the normal consequence of what happened.
Let's say the user was a bit too harsh in cleaning up stalled jobs, that were actually still running. Let's say the user deleted a job from the DB. There may be other examples. Even if those are not good things, I wonder if crashing the server is the 1/ right 2/ expected thing to do.
I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...
For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.
I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...
Maybe we could add an option to make the worker more fault-tolerant. Eventually, after the worker refactoring #1114 was finished,
For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.
I am torn about which is the better choice, but I am also okay with aborting such a task without retrying it. I also have also noticed another problem with my fix. The task won't be aborted at all after a retry. The retry will just reset it to todo
. I will create a new PR with an alternative implementation that aborts the job.
Another question then might be how to handle a job of a task that raises, has an aborting status, but is not intended to be retried. Should this also have an aborted status at the end or only the failed ones that should be retried?
Another alternative would be to set a task that raises and has an aborting status to failed
, regardless of if it should be retried or not. We could add this to the documentation. Looks the most consistent to me.
From the user side of things, it feels right to set the job status as failed
:
failed
rather than aborted
, it feels expected in case of errors;(thanks a lot for sharing your voice)
(thanks a lot for sharing your voice)
Thanks for such a quick reaction to the issue. And just in general for making this project - it's good to have a PSQL-based task queue alternative 😊
Is it too late to reconsider the "aborting" status?
I wonder if things wouldn't be simpler by not having such a status. Yes, there would be a need to store somewhere that a task has been requested to be aborted but is a status an appropriate way?
Also, maybe a good idea to have a state diagram of the state transitions.
See https://github.com/timgit/pg-boss/blob/master/docs/readme.md#job-states for reference.
I am sorry, but I think that ship has already sailed. If we were to get rid of that status now, then it would be a breaking change. I also still think it's not a wrong decision to have it as a status. I would think differently if it masked another state, but it doesn't. But yes, a state diagram would be nice. Would you create a new issue so that we remember that?
Yes, it would be a breaking change and require a new major version.
It is masking the doing
state.
State transitions from aborting
are the same than doing
.
A job marked as aborting
is not aborting until the worker reacts to the change.
Okay, after some more thought, I think moving the aborting state to a separate database field could make sense. Also, since the user does not necessarily have to follow this request, it would still be preserved in a subsequent possible retry. As for the breaking change, one could also argue that the job states are internal API. The disadvantage would be that such a change would result in a fairly large migration. @ewjoachim What do you think? Is this a breaking change or not? @onlyann Also we should move this to a new issue (I will close this one with a patch that fixes the reported error).
I think it's worth breaking, but let's discuss it more in a dedicated issue.
@medihack thank you for giving it consideration. This is much appreciated. +1 on making it a separate issue.
Hi guys!
I've been doing some manual QA for a project I'm working on and found some issues with the newly added
abort
logic for task cancellation. By the way, a neat feature I was waiting for, glad it has been added!Here's the general context - I have a single type of task for the procrastinate, and I use a retry strategy over it to handle some expected exceptions. I also use a
context.should_abort_async
within the task to handle its cancellation. The issue appears with the following steps:doing
;aborting
;should_abort_async
check), causing the worker to retry this task due to a retry strategy being used;procrastinate_retry_job
SQL function will be used, which will raise aJob was not found or not in "doing" status
error;This will shut down the worker process and will leave a task in the
procrastinate_jobs
table with anaborting
status. If I just restart it, it won't also process newly added tasks for what it seems due to hanging anaborting
task.I understand that this might be a niche scenario, but I wonder if this behavior with the retry logic has been considered for a task abortion process. A quick and simple fix for this would be to mind the
aborting
status during job retry logic, but there is probably more to it.Thanks for your attention!