lanl / BEE

Other
14 stars 3 forks source link

Initial task manager resiliency and error handling #789

Closed jtronge closed 6 months ago

jtronge commented 7 months ago

These are some initial changes to improve resiliency and error-handling in the task manager. This doesn't completely resolve issues #675 and #676, but I thought I'd open this now since these changes are relatively self-contained. Rusty and I discussed those issues and we were thinking it might be best to have a longer discussion about them and the interaction between the task manager and the workflow manager.

Also, I think this should resolve #550. The builder should now be throwing exceptions when it fails to build or pull a container. I added an integration test case for an invalid container build.