Exceptionally long wait times in Batch

ducatiMonster916 commented 4 years ago

As per user communication- submitting jobs to CoA can sometimes take up to 6 hours before initiating (whereas the processing time is <1hr typically), significantly slowing down/inhibiting analyses of cohort samples. User has modified the WDL to specify lower runtime requirements to have more access to VM resources to no avail.

See the run time for jobs in green vs purple in the attached figure.

Update 4/23/2021:

When number of active tasks is high, it takes very long time to check the status of each task. Consequently it takes a long time to dequeue new tasks and submit them to CosmosDB. There is likely throttling going on from the Batch APIs as well. Investigate and figure out how to reduce both delays to the minimum (seconds).

The main change will be to stop polling Batch for each task status, and instead use a storage queue that the Batch task can insert to when done with the execution on the node. Other ideas:

retrieve only those fields from CosmosDb that are needed
retrieve the full task from CosmosDb only when needed
maintain write-through in-memory cache of active tasks, load them at startup
batch to insert queue message when task is done, maybe for each step of the executor
don't query for node count all the time, maintain count. Note that nodes take time to go away, so re-query only when close to the max batch quota
send to Batch as soon as receiving the task from Cromwell (but store first, and return from the API call immediately), parallelize sending to Batch

ducatiMonster916 commented 4 years ago

requested the offending WDL to replicate on production instance internally

barrows-biotia commented 4 years ago

I've logged related observations on #107

ducatiMonster916 commented 4 years ago

WDL & input files have been provided to MSGenomics for further review.

tracykard commented 4 years ago

TODO (done in 2.2) - add additional timestamps to mark when the request for a vm started, how long to download, amount of time to decompress, how long to upload results, and the execution time so we can precisely measure these times. As noted by the related issue, smaller vms will take longer times to download and decompress large images.

MattMcL4475 commented 2 years ago

Investigate subscribing to an event like this: https://docs.microsoft.com/en-us/azure/batch/batch-task-complete-event

BMurri commented 1 year ago

Related to #523.

ngambani commented 9 months ago

@BMurri this looks like a long pending issue open since 2020, are we actively working on this or can this be closed?

BMurri commented 9 months ago

@ngambani the implementation of #523 is expected to fix this issue. I'd like to prove it and close this issue with that implementation. If that implementation isn't sufficient, this is a good yardstick to measure our scalability against.

microsoft / CromwellOnAzure

Exceptionally long wait times in Batch #119