Retrying a task, that has been aborted

majorov-daniil commented 1 month ago

Hi guys!

I've been doing some manual QA for a project I'm working on and found some issues with the newly added abort logic for task cancellation. By the way, a neat feature I was waiting for, glad it has been added!

Here's the general context - I have a single type of task for the procrastinate, and I use a retry strategy over it to handle some expected exceptions. I also use a context.should_abort_async within the task to handle its cancellation. The issue appears with the following steps:

a worker has started to process a task, marking its state as doing;
in the middle of it, the task has been aborted from a different process, which changes the task state to aborting;
after that, some expected error happens in the task itself (before the should_abort_async check), causing the worker to retry this task due to a retry strategy being used;
during the retry procrastinate_retry_job SQL function will be used, which will raise a Job was not found or not in "doing" status error;

This will shut down the worker process and will leave a task in the procrastinate_jobs table with an aborting status. If I just restart it, it won't also process newly added tasks for what it seems due to hanging an aborting task.

I understand that this might be a niche scenario, but I wonder if this behavior with the retry logic has been considered for a task abortion process. A quick and simple fix for this would be to mind the aborting status during job retry logic, but there is probably more to it.

Thanks for your attention!

ewjoachim commented 1 month ago

I understand that this might be a niche scenario

No it's a fully fledged bug you've found, and it deserves being fixed as much as any other bug.

ewjoachim commented 1 month ago

So from my understanding, a minimal reproduction code would be:

import procrastinate

app = procrastinate.ProcrastinateApp(connector=procrastinate.PsycopgConnector())

@app.task(pass_context=True, retry=True)
async def mytask(context):
    await await app.job_manager.cancel_job_by_id_async(context.job.id)

    0/0

async def main():
    await mytask.defer_async()
    await app.run_worker_async()

asyncio.run(main())

(note: haven't tested this)

There are 2 things we should do:

This case should be expected and we should react appropriately
And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker
(add the above example as a test)

medihack commented 1 month ago

I will provide a fix as soon as possible.

This case should be expected and we should react appropriately

It should be enough to also allow aborting in the procrastinate_retry_job function. This will need another migration. Is there anything else you'd be able to think of?

And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker

(add the above example as a test)

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

medihack commented 1 month ago

When I think about it ... the job should maybe not be retried then, but nevertheless aborted (the abortion request came earlier than the error). That way we wouldn't need an additional migration. Both ways seem acceptable to me. @ewjoachim What do you think?

And when I even think more about it 😉 ... the migration is a better solution. The user may only want to handle the abortion in well-defined states by himself. So, I would respect this by just doing the retry then.

ewjoachim commented 1 month ago

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

I think it could go either way, but I think it might be possible that without a bug from procrastinate, the user would find themselves in a situation where the job couldn't be found anymore (or in the correct state), and while I would expect this to be noisy, crashing the worker for that would, itself, be the bug more than the normal consequence of what happened.

Let's say the user was a bit too harsh in cleaning up stalled jobs, that were actually still running. Let's say the user deleted a job from the DB. There may be other examples. Even if those are not good things, I wonder if crashing the server is the 1/ right 2/ expected thing to do.

I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

ewjoachim commented 1 month ago

For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

medihack commented 1 month ago

I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

Maybe we could add an option to make the worker more fault-tolerant. Eventually, after the worker refactoring #1114 was finished,

For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

I am torn about which is the better choice, but I am also okay with aborting such a task without retrying it. I also have also noticed another problem with my fix. The task won't be aborted at all after a retry. The retry will just reset it to todo. I will create a new PR with an alternative implementation that aborts the job.

Another question then might be how to handle a job of a task that raises, has an aborting status, but is not intended to be retried. Should this also have an aborted status at the end or only the failed ones that should be retried?

Another alternative would be to set a task that raises and has an aborting status to failed, regardless of if it should be retried or not. We could add this to the documentation. Looks the most consistent to me.

majorov-daniil commented 1 month ago

From the user side of things, it feels right to set the job status as failed:

An error occurred during abortion, so the status reflects that;
Retry shouldn't mind errors during abortion (though this is debatable I guess);
Ultimately, if a job has been marked for abortion I would want to finish its processing. So in this logic, if an error occurs and a job ends up failed rather than aborted, it feels expected in case of errors;

ewjoachim commented 1 month ago

(thanks a lot for sharing your voice)

majorov-daniil commented 1 month ago

(thanks a lot for sharing your voice)

Thanks for such a quick reaction to the issue. And just in general for making this project - it's good to have a PSQL-based task queue alternative 😊

onlyann commented 1 month ago

Is it too late to reconsider the "aborting" status?

I wonder if things wouldn't be simpler by not having such a status. Yes, there would be a need to store somewhere that a task has been requested to be aborted but is a status an appropriate way?

Also, maybe a good idea to have a state diagram of the state transitions.

See https://github.com/timgit/pg-boss/blob/master/docs/readme.md#job-states for reference.

medihack commented 1 month ago

I am sorry, but I think that ship has already sailed. If we were to get rid of that status now, then it would be a breaking change. I also still think it's not a wrong decision to have it as a status. I would think differently if it masked another state, but it doesn't. But yes, a state diagram would be nice. Would you create a new issue so that we remember that?

onlyann commented 1 month ago

Yes, it would be a breaking change and require a new major version.

It is masking the doing state. State transitions from aborting are the same than doing.

A job marked as aborting is not aborting until the worker reacts to the change.

medihack commented 1 month ago

Okay, after some more thought, I think moving the aborting state to a separate database field could make sense. Also, since the user does not necessarily have to follow this request, it would still be preserved in a subsequent possible retry. As for the breaking change, one could also argue that the job states are internal API. The disadvantage would be that such a change would result in a fairly large migration. @ewjoachim What do you think? Is this a breaking change or not? @onlyann Also we should move this to a new issue (I will close this one with a patch that fixes the reported error).

ewjoachim commented 1 month ago

I think it's worth breaking, but let's discuss it more in a dedicated issue.

onlyann commented 1 month ago

@medihack thank you for giving it consideration. This is much appreciated. +1 on making it a separate issue.

procrastinate-org / procrastinate

Retrying a task, that has been aborted #1131