Keep the failed batch task (and node) around, with configurable deletion time.
This should be done for executor errors (like failed download or upload), as well as successful tasks for which cromwell_rc is non-zero. This would only work if the node was actually created.
We should also have easy to follow instructions how to rerun the task directly on the node – most likely a script in a convenient location, see if the batchScript as is would be enough.
Note that will make the batch job creation (no auto-delete) and eventual deletion logic a bit more complex, and would cost to keep the Batch nodes around. This could be combined with a feature that purposely keeps all nodes around for brief amount of time in case the next task can use that VM and execute immediately.
This needs to be provided an opt-in in the configuration, defaulting to false.
Keep the failed batch task (and node) around, with configurable deletion time. This should be done for executor errors (like failed download or upload), as well as successful tasks for which cromwell_rc is non-zero. This would only work if the node was actually created. We should also have easy to follow instructions how to rerun the task directly on the node – most likely a script in a convenient location, see if the batchScript as is would be enough. Note that will make the batch job creation (no auto-delete) and eventual deletion logic a bit more complex, and would cost to keep the Batch nodes around. This could be combined with a feature that purposely keeps all nodes around for brief amount of time in case the next task can use that VM and execute immediately. This needs to be provided an opt-in in the configuration, defaulting to false.