Issue

I found a few lines of code which were repeated multiple times in the codebase, mostly to no effect. Removing them has caused a huge dataproc expenditure.

Detail

inputs.get_jobs(target) is a call used in the pipeline to gather all the jobs belonging to prior stages in the workflow graph. This is applied to all jobs here prior to the workflow being launched. It is also redundantly added in multiple locations e.g. here in the seqr-loader workflow.

During development of the GATK-SV and gCNV workflows I used some of these existing stages as a template, including the manual setting of dependencies. I recently spotted that the same dependencies were applied redundantly and removed the in-Stage dependency setting. The workflow graph was unaffected (Stages were not executed out of order as a result).

Just now I spotted an issue - the way we're starting DataProc clusters occurs outside the main Stage wrapper framework. There was an input to the DataProc cluster generation which manually collected all dependent jobs from prior stages and added them as dependencies to the DataProc start-up job, ensuring that the cluster was only started when required. Without the dependencies set, the clusters were all started at the workflow outset, and remained up until the workflow completed.

When testing the changes initially, I was using the V-C cohort, which does not feature in Seqr, so the DataProc clusters were not involved at all - the workflow was unaffected by the changes, and I was satisfied that the manual dependency setting was not required for correct generation of a workflow graph. This testing was not adequate for the changes specific to DataProc cluster startup.

DataProc Jobs

The reason DataProc jobs occur outside the workflow graph - we use helper functions to handle DataProc start-up and tear-down. When we 'add a dataproc job', we really create 3 jobs

start up the cluster (this uses the manually passed depends_on job list to control start-up)
add an always-run job to tear down the dataproc cluster. Always-run ensures the clusters are killed no matter whether the workflow succeeds
add the actual functional job between 1 & 2

This helper method only returns one job, the functional one, and the cpg-workflows stage scheduling only automatically controls dependencies on that one job. My changes removed the limit on the start-up job, which lead to premature execution.

populationgenomics / production-pipelines

Manual dependency setting required for DataProc #791

Issue

Detail

DataProc Jobs