Jug design question - failed tasks

unode commented 7 years ago

Hi all,

While using and developing jug_schedule I've come across several design issues that boil down to the question:

How can I tell if a task has failed before?

Locks in jug are created once and left for the entire duration of execution. They are always removed when a task finishes, even if it fails, unless processes or nodes crash in which case they are left on the filesystem and percived by jug as "Running".

The consequence of removing locks on failures is that every time a new jug execute is launched the failing task is retried.

The ideal solution here would be a keep-alive mechanism identical to what is present in NGLess in addition to leaving the lock behind in case of failures.

Some brainstorming on other alternatives include (with caveats):

Keep locks on failure and modify them to distinguish from 'running'. Failure could be encoded as:
- timestamp to 'epoch + 1second' (the nix way)
- write a failure message into the file

These provide sufficient information to know if a task failed but are still insufficient to distinguish running from crashed. The lock would still be left behind so for all practical purposes the task wouldn't be retried until jug cleanup is used.

Extras:

Presence of a lock prevents depending tasks from starting and keeps them waiting
- This would have to be modified to "give up" if the lock is in 'failed' state
jug cleanup removes all locks, regardless of their state. jug cleanup --failed-only could make sense here.
jug status could display a useful Failed column.

luispedro commented 7 years ago

If done with a option jug execute --keep-failed this could be added experimentally using the epoch+1 technique.

A couple of helper functions could then also be added to jug shell, something get_failed_tasks().

unode commented 7 years ago

How do you suggest this is implemented on the lower level interfaces?

Add is_failed and fail to backends.base.base_lock?
Add is_failed and fail to backends.base.base_store?

Is it ok to modify the interface and break backwards compatibility? i.e. adding new @abstractmethods will make any third-party stores fail.

luispedro commented 6 years ago

Closing now that #59 was merged.

luispedro / jug

Jug design question - failed tasks #55