populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
6 stars 1 forks source link

DatasetStages don't work in a MultiCohort world #852

Closed MattWellie closed 3 months ago

MattWellie commented 3 months ago

https://batch.hail.populationgenomics.org.au/batches/471684/jobs/1

This run contains 14 Cohorts, and ~15 Datasets, but when it got to the AnnotateDataset DatasetStage it was trying to queue ~150 different DatasetStages. This is because the current implementation of DatasetStage when using a MultiCohort is done by iterating over each Cohort, then over each Dataset in each Cohort.

https://github.com/populationgenomics/production-pipelines/blob/a3dfe6172d8c3ff191b17ca49756225564aceadb/cpg_workflows/workflow.py#L1386-L1391

This needs to be restructured so that each DatasetStage runs once per Dataset (collected across all Cohorts). This is because the Dataset concept is not hierarchically 'below' the Cohort, it's a different type of grouping, so when finding Datasets we shouldn't be looking one Cohort at a time, we should be looking across all samples in this analysis, i.e.

MultiCohort contains COH1 & COH2

COH1 contains 
- Dataset1: CPGA, CPGB
- Dataset2: CPGC

COH2 contains 
- Dataset1: CPGD, CPGE
- Dataset2: CPGF

When a DatasetStage runs, it should run: 
- once on Dataset1, the collection of SGs [CPGA, CPGB, CPGD, CPGE]
- once on Dataset2, the collection of SGs [CPGC, CPGF]

Instead of the current behaviour:

- once on Dataset1, SGs [CPGA, CPGB]
- again on Dataset1, SGs [CPGD, CPGE]
- once on Dataset2, the collection of SGs [CPGC]
- again Dataset2, SGs [CPGF]
MattWellie commented 3 months ago

I've poked around in this, and it's actually going to be a lot of work to resolve...

Currently the structure we have is

MultiCohort {1}
-- Cohort {1, many}
---- Dataset {1, many}
------ SequencingGroup {1, many}

(IMO) This needs to be restructured to

MultiCohort {1}
-- Cohort {1, many}
---- SequencingGroup {1, many}
-- Dataset {1, many}
---- SequencingGroup {1, many}

Reflecting that both DatasetStages and CohortStages run on a subset of the overall MultiCohort SG group. We do not need to run them in the context of each other (i.e. DatasetStage will never run on 'the portion of this Cohort which is also in this Dataset'. Likewise a CohortStage will run on all SGIDs in the Cohort, crossing all Dataset boundaries).

The concept of building Targets/Inputs for the pipeline is built around the MultiCohort > Cohort > Dataset > SGID structure, which has some complications:

  1. Currently a Dataset has a self.Cohort attribute, instead of a self.MultiCohort attribute, so it does not have a connection with the top of the Hierarchy
  2. When populating SG IDs in the analysis we query for each Cohort, then bin the results by Dataset. The only way (currently) to add new SGIDs into an analysis is as a property of a Dataset object
MattWellie commented 3 months ago

This will create problems where we currently have CohortStage -> DatasetStage. The main instances of this I can see are in seqr_loader (where the CohortStages are really MultiCohortStages, they just haven't been updated yet), and gCNV where there really are CohortStage -> DatasetStage transitions.

in seqr_loader this can be fixed by updating the code in gCNV we can either temporarily patch the code to be MultiCohort stages consistently (on the basis that it's only run one Cohort at a time), or we can implement a merging step so the CNV VCFs from each separate Cohort are combined in a MultiCohort Stage, then the aggregate result is split out by Dataset (this is on the roadmap, we're planning to use Jasmine to pseudo-joint-call)

MattWellie commented 3 months ago

Doc here with a fleshed out description of the issues which currently exist on main https://docs.google.com/document/d/1Zu4QFxuR44M1alRZtA5M9WGROLEUg74gQWorDQjOvTs/edit