Open ducatiMonster916 opened 4 years ago
requested the offending WDL to replicate on production instance internally
I've logged related observations on #107
WDL & input files have been provided to MSGenomics for further review.
TODO (done in 2.2) - add additional timestamps to mark when the request for a vm started, how long to download, amount of time to decompress, how long to upload results, and the execution time so we can precisely measure these times. As noted by the related issue, smaller vms will take longer times to download and decompress large images.
Investigate subscribing to an event like this: https://docs.microsoft.com/en-us/azure/batch/batch-task-complete-event
Related to #523.
@BMurri this looks like a long pending issue open since 2020, are we actively working on this or can this be closed?
@ngambani the implementation of #523 is expected to fix this issue. I'd like to prove it and close this issue with that implementation. If that implementation isn't sufficient, this is a good yardstick to measure our scalability against.
As per user communication- submitting jobs to CoA can sometimes take up to 6 hours before initiating (whereas the processing time is <1hr typically), significantly slowing down/inhibiting analyses of cohort samples. User has modified the WDL to specify lower runtime requirements to have more access to VM resources to no avail.
See the run time for jobs in green vs purple in the attached figure.
Update 4/23/2021:
When number of active tasks is high, it takes very long time to check the status of each task. Consequently it takes a long time to dequeue new tasks and submit them to CosmosDB. There is likely throttling going on from the Batch APIs as well. Investigate and figure out how to reduce both delays to the minimum (seconds).
The main change will be to stop polling Batch for each task status, and instead use a storage queue that the Batch task can insert to when done with the execution on the node. Other ideas: