Closed MattWellie closed 3 months ago
I've poked around in this, and it's actually going to be a lot of work to resolve...
Currently the structure we have is
MultiCohort {1}
-- Cohort {1, many}
---- Dataset {1, many}
------ SequencingGroup {1, many}
(IMO) This needs to be restructured to
MultiCohort {1}
-- Cohort {1, many}
---- SequencingGroup {1, many}
-- Dataset {1, many}
---- SequencingGroup {1, many}
Reflecting that both DatasetStages and CohortStages run on a subset of the overall MultiCohort SG group. We do not need to run them in the context of each other (i.e. DatasetStage will never run on 'the portion of this Cohort which is also in this Dataset'. Likewise a CohortStage will run on all SGIDs in the Cohort, crossing all Dataset boundaries).
The concept of building Targets/Inputs for the pipeline is built around the MultiCohort > Cohort > Dataset > SGID structure, which has some complications:
This will create problems where we currently have CohortStage -> DatasetStage. The main instances of this I can see are in seqr_loader (where the CohortStages are really MultiCohortStages, they just haven't been updated yet), and gCNV where there really are CohortStage -> DatasetStage transitions.
in seqr_loader this can be fixed by updating the code in gCNV we can either temporarily patch the code to be MultiCohort stages consistently (on the basis that it's only run one Cohort at a time), or we can implement a merging step so the CNV VCFs from each separate Cohort are combined in a MultiCohort Stage, then the aggregate result is split out by Dataset (this is on the roadmap, we're planning to use Jasmine to pseudo-joint-call)
Doc here with a fleshed out description of the issues which currently exist on main https://docs.google.com/document/d/1Zu4QFxuR44M1alRZtA5M9WGROLEUg74gQWorDQjOvTs/edit
https://batch.hail.populationgenomics.org.au/batches/471684/jobs/1
This run contains 14 Cohorts, and ~15 Datasets, but when it got to the AnnotateDataset DatasetStage it was trying to queue ~150 different DatasetStages. This is because the current implementation of DatasetStage when using a MultiCohort is done by iterating over each Cohort, then over each Dataset in each Cohort.
https://github.com/populationgenomics/production-pipelines/blob/a3dfe6172d8c3ff191b17ca49756225564aceadb/cpg_workflows/workflow.py#L1386-L1391
This needs to be restructured so that each DatasetStage runs once per Dataset (collected across all Cohorts). This is because the Dataset concept is not hierarchically 'below' the Cohort, it's a different type of grouping, so when finding Datasets we shouldn't be looking one Cohort at a time, we should be looking across all samples in this analysis, i.e.