:sparkles:MultiCohorts:sparkles:

vivbak commented 1 month ago

Closes https://github.com/populationgenomics/production-pipelines/issues/710

Currently, for each pipeline run a ‘cohort’ is defined, comprising a list of one or more ‘datasets’, comprising a list of one or more ‘sgs’

In this PR, we propose that for each pipeline a ‘multicohort’ will be defined, comprising a list of one or more ‘cohorts’, comprising a list of one or more ‘datasets’, comprising a list of one or more ‘sgs’. 🤯

This means that a user can specify a list of custom cohort ID's (rather than just one), and stages will be generated accordingly.

The key changes here

The addition of a new MultiCohortStage, which will act on all of the cohorts at once.
The addition of a new MultiCohort Input, to facilitate the above.
Switching out the existing cohort global variable, with a multicohot

It is important to note, that we are still supporting a non-cohort run of production pipelines, which means there needs to be some logic to support the old way of doing things (pre-multi-cohorts) for the time being.

TODO

metamist.py

[x] Create get_multi_cohort method

targets.py

[x] Create a new MultiCohort Target.
- [x] Add get_cohorts method
- [x] Add create_cohort method

inputs.py

[x] Modify get_cohort -> get_multicohort()
[x] Modify create_cohort() -> create_multi_cohort()
[x] Consider how to handle current instances of get_cohort(), for example intervals_path=inputs.as_path(get_cohort(), PrepareIntervals, 'preprocessed'), in a SequencingGroupStage where the dependent stage is a CohortStage. *
[x] Add validation for input_cohorts to ensure it is a list, otherwise a string will be list-ified.

workflow.py

[x] Create MultiCohortStage
[x] Update all instances of get_cohort, as well as relevant implementation changes.
[x] Update set_stages, L1148
[x] Update queue jobs with checks
[x] Update _prefix
[x] Update queue for cohort
[x] Update batch name, L883

*test_.py**

[x] Update Existing Tests

test_cohort.py

[x] Add test_multi_cohort()

sample_qc.py

[x] Switch get_cohort to get_inputs.

combbiner.py

[x] Switch get_cohort to get_inputs

Note for VB, add co-author credit in merge.

*actually I think this is fine, because it will return the outputs for each cohort, but we need to make sure that the new return structures here are suitable.

vivbak commented 3 weeks ago

We need to review all the existing Cohort stages to see if they are eligible (or more suited) to be a MultiCohort stage.

jmarshall commented 3 weeks ago

We need to review all the existing Cohort stages to see if they are eligible (or more suited) to be a MultiCohort stage.

I meant to also say: I guess we can do this at our leisure and on a case-by-case basis. It probably doesn't need to be included in this base infrastructure update PR.

vivbak commented 3 weeks ago

I meant to also say: I guess we can do this at our leisure and on a case-by-case basis. It probably doesn't need to be included in this base infrastructure update PR.

@jmarshall, the only issue would be if someone supplied multiple cohorts. If most users are using a single cohort at a time, then you're right we can do it case by case.

populationgenomics / production-pipelines

:sparkles:MultiCohorts:sparkles: #764

TODO