Keep the failed batch task around to make debugging of failed tasks easier.

tonybendis commented 3 years ago

Keep the failed batch task (and node) around, with configurable deletion time. This should be done for executor errors (like failed download or upload), as well as successful tasks for which cromwell_rc is non-zero. This would only work if the node was actually created. We should also have easy to follow instructions how to rerun the task directly on the node – most likely a script in a convenient location, see if the batchScript as is would be enough. Note that will make the batch job creation (no auto-delete) and eventual deletion logic a bit more complex, and would cost to keep the Batch nodes around. This could be combined with a feature that purposely keeps all nodes around for brief amount of time in case the next task can use that VM and execute immediately. This needs to be provided an opt-in in the configuration, defaulting to false.

MattMcL4475 commented 2 years ago

Add as a configuration setting for TES in appsettings.json so it can be toggled at runtime

ngambani commented 9 months ago

@MattMcL4475 this looks like a long pending issue open since 2021, can this be closed? Or do we want to include it on our 2024 roadmap?

microsoft / CromwellOnAzure

Keep the failed batch task around to make debugging of failed tasks easier. #219