microsoft / CromwellOnAzure

Microsoft Genomics implementation of the Broad Institute's Cromwell workflow engine on Azure
MIT License
134 stars 55 forks source link

Exceptionally long wait times in Batch #119

Open ducatiMonster916 opened 4 years ago

ducatiMonster916 commented 4 years ago

As per user communication- submitting jobs to CoA can sometimes take up to 6 hours before initiating (whereas the processing time is <1hr typically), significantly slowing down/inhibiting analyses of cohort samples. User has modified the WDL to specify lower runtime requirements to have more access to VM resources to no avail.

image

See the run time for jobs in green vs purple in the attached figure.

Update 4/23/2021:

When number of active tasks is high, it takes very long time to check the status of each task. Consequently it takes a long time to dequeue new tasks and submit them to CosmosDB. There is likely throttling going on from the Batch APIs as well. Investigate and figure out how to reduce both delays to the minimum (seconds).

The main change will be to stop polling Batch for each task status, and instead use a storage queue that the Batch task can insert to when done with the execution on the node. Other ideas:

ducatiMonster916 commented 4 years ago

requested the offending WDL to replicate on production instance internally

barrows-biotia commented 4 years ago

I've logged related observations on #107

ducatiMonster916 commented 4 years ago

WDL & input files have been provided to MSGenomics for further review.

tracykard commented 4 years ago

TODO (done in 2.2) - add additional timestamps to mark when the request for a vm started, how long to download, amount of time to decompress, how long to upload results, and the execution time so we can precisely measure these times. As noted by the related issue, smaller vms will take longer times to download and decompress large images.

MattMcL4475 commented 2 years ago

Investigate subscribing to an event like this: https://docs.microsoft.com/en-us/azure/batch/batch-task-complete-event

BMurri commented 1 year ago

Related to #523.

ngambani commented 9 months ago

@BMurri this looks like a long pending issue open since 2020, are we actively working on this or can this be closed?

BMurri commented 9 months ago

@ngambani the implementation of #523 is expected to fix this issue. I'd like to prove it and close this issue with that implementation. If that implementation isn't sufficient, this is a good yardstick to measure our scalability against.