Closed unode closed 6 years ago
If done with a option jug execute --keep-failed
this could be added experimentally using the epoch+1 technique.
A couple of helper functions could then also be added to jug shell
, something get_failed_tasks()
.
How do you suggest this is implemented on the lower level interfaces?
is_failed
and fail
to backends.base.base_lock
?is_failed
and fail
to backends.base.base_store
?Is it ok to modify the interface and break backwards compatibility? i.e. adding new @abstractmethods
will make any third-party stores fail.
Closing now that #59 was merged.
Hi all,
While using and developing jug_schedule I've come across several design issues that boil down to the question:
Locks in jug are created once and left for the entire duration of execution. They are always removed when a task finishes, even if it fails, unless processes or nodes crash in which case they are left on the filesystem and percived by jug as "Running".
The consequence of removing locks on failures is that every time a new
jug execute
is launched the failing task is retried.The ideal solution here would be a keep-alive mechanism identical to what is present in NGLess in addition to leaving the lock behind in case of failures.
Some brainstorming on other alternatives include (with caveats):
These provide sufficient information to know if a task failed but are still insufficient to distinguish running from crashed. The lock would still be left behind so for all practical purposes the task wouldn't be retried until
jug cleanup
is used.Extras:
jug cleanup
removes all locks, regardless of their state.jug cleanup --failed-only
could make sense here.jug status
could display a useful Failed column.