Cancelling a task is not resilient enough

benoit74 commented 3 months ago

When a user or the periodic scheduler requests a task cancellation, the information is stored in DB.

It it then the worker manager responsibility to detect that these tasks have been asked to be removed, and to really cancel them on the worker.

The limitation of this is that when the worker manager is down (e.g. because one task is consuming way too much disks and the worker manager shut itself down, encountered on zimit.kiwix.org when someone tricked the size limit), then nothing can stop the task.

I wonder if the task manager should not itself query the API to check if it should stop himself.

Having both the task manager and the worker manager cancelling tasks has to be assessed for impact, but it should be feasible.

Another concern is that if the task manager is gone, then the worker manager has no idea about how to kill other containers linked to this task. Currently it is the task manager responsibility to properly terminate all containers associated to it. I think we should take into account the scenario where the task manager is gone / not responsive and be able to kill all containers associated to a given task from the worker manager.

rgaudin commented 3 months ago

I like that the solution has a clear design and intent with those roles separate.

The manager has a responsibility and if it's not running those can not be fulfilled obviously. If the manager is shutdown on purpose, then that's the worker's admin responsibility.

It the manager stops itself (currently when avail disk space is lower than expected), then we can revisit this action.

Changing the responsibilities of the actors because of some behavior at the feature/bug border feels 😳 to me

benoit74 commented 3 months ago

Makes sense, you're probably right, we should probably rather consider that the manager should not stop itself no matter the reason.

When disk space is lower than expected the manager should continue to run but just refuse to start new tasks (which is the primary goal as far as I've understood).

And we should probably revisit the actions available to a zimfarm worker admin so that it is possible to stop scheduling new tasks (typically because maintenance is need) but still keep the manager running. This is the current situation in badger2, and it means that we cannot cancel running tasks because worker admin want to cleanly stop its worker.

Both would be way simpler / straightforward indeed.

rgaudin commented 3 months ago

Exactly. We could also imagine sending richer feedback than ping from manager to the API.

openzim / zimfarm

Cancelling a task is not resilient enough #1002