lanl / BEE

Other
14 stars 3 forks source link

bee_init and bee_exit control node functionality #141

Closed trandles-lanl closed 3 years ago

trandles-lanl commented 4 years ago

I have some thoughts on how we might try to leverage the bee_init and bee_exit nodes in the graph.

bee_init

Contains metadata controlling the start of workflow execution. For instance, a user can set an "alarm" or a "timer." An "alarm" would indicate when a workflow should start running in the future. A "timer" would delay the start of a workflow.

bee_exit

Contains metadata controlling post-execution actions for a workflow. For instance, a user can say "start workflow after this workflow finishes."

mcpherson commented 4 years ago

Hmmm, I'm going to have to think about that. You're basically modifying the workflow with metadata and not the CWL description. If we assume that the saved database represents the repeatable workflow, we'd then need to save some kind of package of database files (e.g. including the ones that are subsequently started at bee_exit).

trandles-lanl commented 4 years ago

Good point about bee_exit. Maybe that's not such a good idea. I could imagine wanting to use bee_exit to do things like record the exit state of the workflow. If that's less than success (i.e. some task failed), then we could record the failed task_id and its exit code and error message. That would make identifying the culprit a lot easier in a complex workflow. It might also make restarting a failed workflow easier. Perhaps the error was caused by a missing Charliecloud container image. The user could put the image in place and say "restart from failed task."

Using bee_init for an alarm or timer function is probably fair game because I think it's outside of the scope of CWL itself. It's not changing the workflow in any way. It's only controlling a condition under which the workflow can begin. The same functionality could be implemented for tasks in a workflow running on a slurm cluster using the --begin switch for sbatch.

guanxyz commented 4 years ago

The fault-tolerance capability should be highlighted as one of the major features of BEE. Our orchestration has a global view of state of a workflow (init, waiting, running, et al). With proper configuration of timeout, BEE can decide to kill the waiting/no-responding/timeout tasks and go back to database to restart the task.

mcpherson commented 4 years ago

The state of the workflow is always captured by the current state of the database (live or archived on disk). I think all of what you want to do with bee_exit can be done with the database.

mcpherson commented 3 years ago

Largely outdated by current state of GDB. Recommend closing and moving discussion to complex workflow upgrade.

trandles-lanl commented 3 years ago

Agree on closing this. @Boogie3D or @mcpherson feel free to close once you've captured anything you want to preserve in the new complex workflow plan.