populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

:sparkles:MultiCohorts:sparkles: #764

Closed vivbak closed 1 week ago

vivbak commented 1 month ago

Closes https://github.com/populationgenomics/production-pipelines/issues/710

Currently, for each pipeline run a ‘cohort’ is defined, comprising a list of one or more ‘datasets’, comprising a list of one or more ‘sgs’

In this PR, we propose that for each pipeline a ‘multicohort’ will be defined, comprising a list of one or more ‘cohorts’, comprising a list of one or more ‘datasets’, comprising a list of one or more ‘sgs’. 🤯

This means that a user can specify a list of custom cohort ID's (rather than just one), and stages will be generated accordingly.

The key changes here

It is important to note, that we are still supporting a non-cohort run of production pipelines, which means there needs to be some logic to support the old way of doing things (pre-multi-cohorts) for the time being.

TODO

metamist.py

targets.py

inputs.py

workflow.py

*test_.py**

test_cohort.py

sample_qc.py

combbiner.py

Note for VB, add co-author credit in merge.

*actually I think this is fine, because it will return the outputs for each cohort, but we need to make sure that the new return structures here are suitable.

vivbak commented 3 weeks ago

We need to review all the existing Cohort stages to see if they are eligible (or more suited) to be a MultiCohort stage.

jmarshall commented 3 weeks ago

We need to review all the existing Cohort stages to see if they are eligible (or more suited) to be a MultiCohort stage.

I meant to also say: I guess we can do this at our leisure and on a case-by-case basis. It probably doesn't need to be included in this base infrastructure update PR.

vivbak commented 3 weeks ago

I meant to also say: I guess we can do this at our leisure and on a case-by-case basis. It probably doesn't need to be included in this base infrastructure update PR.

@jmarshall, the only issue would be if someone supplied multiple cohorts. If most users are using a single cohort at a time, then you're right we can do it case by case.