Closed MattWellie closed 1 year ago
We have to support missing metadata, so unless you want to fail the dataset if any has missing data, I'd advocate against (1).
If it's critical to have metadata, then could the pipeline exclude them? What is the minimum set of metadata required to be actionable?
I was just throwing these points up as options, but only 3
is reasonable IMO. It makes sense that not all metadata is present, and the endpoint has a specific purpose to serve the metadata it does have. Serving empty objects for every participant just for this use case would be wild.
The whole 'per-person panels' behaviour is an optional extra, but I'm going to force a condition that if you provide personalised HPO-matched panels for any participant, you need to do the same for every affected participant in a cohort, even if this only contains a default panel (mendeliome). I've got a helper script that generates this data, and I've altered it to provide for all probands in the cohort, even those without metamist HPO data.
TL;DR don't sweat it, no Metamist changes required. Just spitballing changes, but I've solved this issue within AIP
There's a discrepancy in data served by different endpoints -
When pulling pedigree data, we receive data for every sample in the pedigree. This data is also used to guide the content of the joint call, and used during MatrixTable filtering, so the pedigree data is in sync with the participants present in the variant data.
When generating personalised panel data using the seqr metadata endpoint, we receive data for every sample which has logged metadata (HPO terms, etc.)
Data loaded into this table is not necessarily in sync with the full pedigree for the same cohort. This leads to a potential issue downstream where some participants don't have any entry in the phenotype-matched panel data result, which causes things to fall over.
Potential solutions: