microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
32 stars 26 forks source link

Pursue more efficient effective intra-task compute node cleanup #713

Closed BMurri closed 3 months ago

BMurri commented 3 months ago

Problem:

711 Addressed an edge case that would leave a compute node "dirty" after a task failure (specifically when pulling docker image layers), but it did not restore the previous attempts to ensure no filesystem usage from the previous task was still present before the new task started its downloads. We also will implement the performance enhancement of not removing the docker image if it can be reused by the next task, while still ensuring that the last image is not left behind when the next task's image is different.

Solution: Before starting to process the task input files:

  1. Flush any docker volumes that may have been left behind.
  2. If the docker image does not exactly match the last docker image used on the compute node, delete the image (this task is moved from deleting the image after the docker container completes execution). This will reduce time and cost when the same image is repeated used on the same node.
  3. Ensure that the only task's directory on the node is the currently running one.
  4. Remove the deletion of the docker image after the container execution completes.

Describe alternatives you've considered Restoring the previous bash implementation of this feature.

Code dependencies Will this require code changes in:

Additional context Before #711 we depend on Azure Batch to scrub previous tasks' directories and we delete docker images right after container execution completes. We received reports of same task in same pool reusing nodes sometimes resulting in DiskFull failure events.

711 addresses the possibility that some other interleaving task's docker image pull left some orphaned layers behind resulting in insufficient remaining disk space.

This issue addresses a defense in depth practice around ensuring that previous tasks disk usage has been removed before the new task starts filling the disk up (previous observations had found instances where Azure Batch was still deleting previous task's directories while the new task was downloading, resulting in NodeUnusable due to DiskFull failures) and by reimplementing that insurance scheme we desire to also implement the optimization of not removing an image if the next container that will run could simply reuse it.

We expect this combination to greatly minimize future instances of DiskFull failures.