GATK-SV transition to MultiCohorts

MattWellie commented 6 days ago

Closes #811

Absolutely monstrous line count - this is almost entirely caused by merging the multisample_1 & 2 files (1530 lines deleted) into a single multisample file (1328 lines)

Other changes are the deletion of a few config files which would no longer be required:

stop_at_filter_batch.toml
gatk_sv_sandwich.toml
all_batch_names.toml
genotypebatch.toml
gatk_sv_multisample_1/2.toml -> a unified gatk_sv_multisample.toml

Creates a new write_ped_file method in MultiCohort(Stage?) (mirroring the same content in CohortStage)

EddieLF commented 4 days ago

Hey Matt, this all looks pretty great, thanks for the effort you've put in to enable Multicohort & Cohort inputs for these stages.

Just to make sure I understand, if we were to submit some kind of workflow with multiple cohorts like this:

[workflow]
input_cohorts = ["COH1", "COH2", "COH3",]
last_stages = ['AnnotateCohortSv']

and invoke the cpg_workflows/stages/gatk_sv/gatk_sv_multisample.py workflow with analysis runner

analysis-runner --config myconfig.toml ... python main.py gatk_sv_multisample

Is this what would happen?

For each cohort, do:
- MakeCohortCombinedPed
Then, for all cohorts combined as a single "MultiCohort", do:
- MakeMultiCohortCombinedPed
Then, for each cohort, do:
- GatherBatchEvidence -> ClusterBatch -> GenerateBatchMetrics -> FilterBatch
Then, for the single Multicohort, do:
- MergeBatchSites -> CombineExclusionLists
Then, for each cohort, do:
- GenotypeBatch
Then, for the single Multicohort, do:
- MakeCohortVcf -> FormatVcfForGatk -> JoinRawCalls -> SVConcordance -> GeneratePloidyTable ->
- FilterGenotypes -> UpdateStructuralVariantIDs -> AnnotateVcf -> AnnotateVcfWithStrvctvre ->
- SpiceUpSVIDs -> AnnotateCohortSv

MattWellie commented 4 days ago

Is this what would happen?

Yeah that's the biscuit. It's a bit of an hourglass shape to it, as there's the MergeBatchSites Stage sitting in the middle of what's otherwise an entirely CohortStage workflow (all the Stages previously in multisample_1)

populationgenomics / production-pipelines

GATK-SV transition to MultiCohorts #812