Improve AIP accessibility

MattWellie commented 3 months ago

Long term goal...

De-couple from CPG as much as possible - the ideal state is "AIP in a box", with a set of defined inputs and documentation on how to generate/format them

Take PED/Ext IDs/Phenotypes from a single file - don't expect to be generating this at runtime using Metamist. That would then become the input for panel-HPO matching, and PanelApp queries
Fully define input/output paths & folders - don't expect to be generating paths on the fly using output_path or similar

Longer term:

Annotated VCF as source data for the process (SVs are... less simple). Requires a way of ingesting annotated VCFs into a MT, specifically a way to split the VEP CSQ field contents.
Define an annotation contract, e.g. a dependency on VEP nnn, with a way of providing ClinVar/SpliceAI annotations.
...?

Simpler/less pressing:

Swap out Peddy for peds - we only need a PED file object, the extra Peddy functionality just makes the install weightier, for no benefit in this project

illusional commented 3 months ago

FWIW, I'd probably define some sort of generic Provider, and then have multiple implementations, one that can find data from metamist, one that pulls it from specific files, which someone else could extend to populate their own data (and then at runtime, you allow a user to specify which input provider to use).

You could go fancy and discover these through python entrypoints (best tutorial ever: https://amir.rachum.com/python-entry-points/) - in fact, it's actually how metamist parsers work for our ETL process: https://github.com/populationgenomics/metamist/blob/e23e45c290ebb4d1125b9530ee93a6e4c1cc2736/etl/load/main.py#L425).

MattWellie commented 3 months ago

Ahhhh I remember that tutorial! Very cool sneks

I think as a first pass I'd like to try abstracting all the elements that are tightly coupled to specific non-generic services, i.e. we get pedigrees , ext. IDs and phenotypes from metamist, so create a stage which uses metamist to create an intermediate file with pedigree data & phenotypes in prod-pipes, and AIP would just maintain a contract against a PED file with 2 additional cols. Bonus points if that develops on top of existing standards like ingesting that information through phenopackets.

The ambition being that we create a very tightly defined set of generic(ish) files needed to run AIP end to end, and the application itself can sit in a docker container with as few additional dependencies as possible and run as a black-box

MattWellie commented 1 month ago

Partially/mostly done -

populationgenomics / automated-interpretation-pipeline

Improve AIP accessibility #406