microsoft / CromwellOnAzure

Microsoft Genomics implementation of the Broad Institute's Cromwell workflow engine on Azure
MIT License
134 stars 55 forks source link

Keep the failed batch task around to make debugging of failed tasks easier. #219

Open tonybendis opened 3 years ago

tonybendis commented 3 years ago

Keep the failed batch task (and node) around, with configurable deletion time. This should be done for executor errors (like failed download or upload), as well as successful tasks for which cromwell_rc is non-zero. This would only work if the node was actually created. We should also have easy to follow instructions how to rerun the task directly on the node – most likely a script in a convenient location, see if the batchScript as is would be enough. Note that will make the batch job creation (no auto-delete) and eventual deletion logic a bit more complex, and would cost to keep the Batch nodes around. This could be combined with a feature that purposely keeps all nodes around for brief amount of time in case the next task can use that VM and execute immediately. This needs to be provided an opt-in in the configuration, defaulting to false.

MattMcL4475 commented 2 years ago
  1. Add as a configuration setting for TES in appsettings.json so it can be toggled at runtime
ngambani commented 9 months ago

@MattMcL4475 this looks like a long pending issue open since 2021, can this be closed? Or do we want to include it on our 2024 roadmap?