populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Handle Dataproc Jobs better #793

Open MattWellie opened 2 weeks ago

MattWellie commented 2 weeks ago

Related to #790

The Dataproc cluster wrapper method we use creates 3 jobs (start, run, close). Of those only the run job is passed back, so the production-pipelines stage wrapper framework can only automatically set dependencies on one job. All other dependencies need to be gathered and fed forward into the wrapper, which is against the pattern established by the pipeline framework.

Proposal - the wrapper is altered to pass back a list of all jobs, not just the 'worker' job. This removes the special case where dependencies need to be fed forward into the wrapper. This will require some changes to RD, GATK-SV, gCNV, and Large-cohort pipelines, but only a trivial change to each (I believe all stages currently using dataproc only generate a single job, and the Stage implementation equally accepts return of a job or list of jobs).

This issue is primarily with cpg-utils, but requires a counterpart here