Closed MattWellie closed 3 months ago
FWIW, I'd probably define some sort of generic Provider, and then have multiple implementations, one that can find data from metamist, one that pulls it from specific files, which someone else could extend to populate their own data (and then at runtime, you allow a user to specify which input provider to use).
You could go fancy and discover these through python entrypoints (best tutorial ever: https://amir.rachum.com/python-entry-points/) - in fact, it's actually how metamist parsers work for our ETL process: https://github.com/populationgenomics/metamist/blob/e23e45c290ebb4d1125b9530ee93a6e4c1cc2736/etl/load/main.py#L425).
Ahhhh I remember that tutorial! Very cool sneks
I think as a first pass I'd like to try abstracting all the elements that are tightly coupled to specific non-generic services, i.e. we get pedigrees , ext. IDs and phenotypes from metamist, so create a stage which uses metamist to create an intermediate file with pedigree data & phenotypes in prod-pipes, and AIP would just maintain a contract against a PED file with 2 additional cols. Bonus points if that develops on top of existing standards like ingesting that information through phenopackets.
The ambition being that we create a very tightly defined set of generic(ish) files needed to run AIP end to end, and the application itself can sit in a docker container with as few additional dependencies as possible and run as a black-box
Partially/mostly done -
Long term goal...
De-couple from CPG as much as possible - the ideal state is "AIP in a box", with a set of defined inputs and documentation on how to generate/format them
output_path
or similarLonger term:
Simpler/less pressing: