luispedro / jug

Parallel programming with Python
https://jug.readthedocs.io
MIT License
412 stars 62 forks source link

Jug design question - failed tasks #55

Closed unode closed 6 years ago

unode commented 7 years ago

Hi all,

While using and developing jug_schedule I've come across several design issues that boil down to the question:

Locks in jug are created once and left for the entire duration of execution. They are always removed when a task finishes, even if it fails, unless processes or nodes crash in which case they are left on the filesystem and percived by jug as "Running".

The consequence of removing locks on failures is that every time a new jug execute is launched the failing task is retried.

The ideal solution here would be a keep-alive mechanism identical to what is present in NGLess in addition to leaving the lock behind in case of failures.

Some brainstorming on other alternatives include (with caveats):

These provide sufficient information to know if a task failed but are still insufficient to distinguish running from crashed. The lock would still be left behind so for all practical purposes the task wouldn't be retried until jug cleanup is used.

Extras:

luispedro commented 7 years ago

If done with a option jug execute --keep-failed this could be added experimentally using the epoch+1 technique.

A couple of helper functions could then also be added to jug shell, something get_failed_tasks().

unode commented 7 years ago

How do you suggest this is implemented on the lower level interfaces?

Is it ok to modify the interface and break backwards compatibility? i.e. adding new @abstractmethods will make any third-party stores fail.

luispedro commented 6 years ago

Closing now that #59 was merged.