populse / capsul

Collaborative Analysis Platform : Simple, Unifying, Lean
Other
7 stars 14 forks source link

Complete execution control in the client API ? #296

Open denisri opened 10 months ago

denisri commented 10 months ago

In the engine API (which I'm trying to document a little bit, by the way), I see start(), stop(), run(), wait() methods, but I don't see stop() or kill(). Isn't there a way to force stop a running workflow ? What about restart() ? if I understand the way it works, there is no "pending" state in the database: as soon as a workflow is inserted, jobs can be queried by workers and start to run. We must think about the case when a job fails for some reason, but can be started again. Then execution should be restarted for all workflow jobs which depend on it. We could change their status to "not started" so that they can run again ?

sapetnioc commented 10 months ago

No, there is no such method yet. It may not be easy to interrupt an ongoing job although not impossible. However, it is quite easy to restart a job. Internally, there are five lists of jobs in the database: ready, waiting, ongoing, done and failed. Each job is exactly in one list and move to others during processing. The status of the job indicate in which list the job is. When a job is created, it goes either to read or waiting depending if it need other jobs to be finished or not before running. When a worker is not executing a job, it takes a job from ready to put it on ongoing and starts its processing. When a job is finished, it goes either to done or failed. If succeeded, waiting jobs depending only on the finished job go to ready. If failed, all dependent jobs go to failed.

Restarting a failed job can be done by putting it back to ready and put all its dependent jobs from failed and to waiting. This have to be an atomic database operation, therefore it has to be done in LUA. The first step would be to define the user API including the information we would like to add to a restarted jobs (for instance, it would be easy to add an execution count for each job). Then I can make the implementation on my own or with you so that you will have a better knowledge of Capsul v3 internal structure.

denisri commented 10 months ago

OK let's do that. The "interrupt job" feature might be mandatory in the future, to kill stuck jobs, or long jobs submitted by mistake with wrong parameters, etc. If workers run sub-processes (as OS processes) there is obviously a way to kill them (if not the worker). If not it's more complex, I know. Killing workers should also be an option in order to completely shut down an engine.