populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Improve AIP accessibility #406

Closed MattWellie closed 1 month ago

MattWellie commented 3 months ago

Long term goal...

De-couple from CPG as much as possible - the ideal state is "AIP in a box", with a set of defined inputs and documentation on how to generate/format them

Longer term:

Simpler/less pressing:

illusional commented 3 months ago

FWIW, I'd probably define some sort of generic Provider, and then have multiple implementations, one that can find data from metamist, one that pulls it from specific files, which someone else could extend to populate their own data (and then at runtime, you allow a user to specify which input provider to use).

You could go fancy and discover these through python entrypoints (best tutorial ever: https://amir.rachum.com/python-entry-points/) - in fact, it's actually how metamist parsers work for our ETL process: https://github.com/populationgenomics/metamist/blob/e23e45c290ebb4d1125b9530ee93a6e4c1cc2736/etl/load/main.py#L425).

MattWellie commented 3 months ago

Ahhhh I remember that tutorial! Very cool sneks

I think as a first pass I'd like to try abstracting all the elements that are tightly coupled to specific non-generic services, i.e. we get pedigrees , ext. IDs and phenotypes from metamist, so create a stage which uses metamist to create an intermediate file with pedigree data & phenotypes in prod-pipes, and AIP would just maintain a contract against a PED file with 2 additional cols. Bonus points if that develops on top of existing standards like ingesting that information through phenopackets.

The ambition being that we create a very tightly defined set of generic(ish) files needed to run AIP end to end, and the application itself can sit in a docker container with as few additional dependencies as possible and run as a black-box

MattWellie commented 1 month ago

Partially/mostly done -