va-big-data-genomics / trellis-mvp-functions

Trellis serverless data management framework for variant calling of VA MVP whole-genome sequencing data.
6 stars 1 forks source link

Launching high frequency of duplicate GATK workflows #9

Closed pbilling closed 4 years ago

pbilling commented 4 years ago

With v0.5.4-2, 36 (23%) duplicate GATK variant calling workflows were launched for processing data from 155 samples. Runtime for duplicate GATK VMs ranged from 0-4 minutes.

Trellis v0 5 4 Test #2 Duplicate GATK runtimes

pbilling commented 4 years ago

Changed database queries, used for launching jobs, to created a :jobRequest semaphore at the time of successful query operation. This node indicates a job has been scheduled, and blocks successive queries from returning results. After adding semaphore node to database model, job duplication rate has dropped below 1%.

Duplicate jobs were due to race conditions between multiple identical queries. It is still possible to hit this race condition, if two queries are both started at the "same" time, but the frequency of that is currently below our acceptable threshold