populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Personalised Panels rigidity #195

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

There's a discrepancy in data served by different endpoints -

When pulling pedigree data, we receive data for every sample in the pedigree. This data is also used to guide the content of the joint call, and used during MatrixTable filtering, so the pedigree data is in sync with the participants present in the variant data.

When generating personalised panel data using the seqr metadata endpoint, we receive data for every sample which has logged metadata (HPO terms, etc.)

Data loaded into this table is not necessarily in sync with the full pedigree for the same cohort. This leads to a potential issue downstream where some participants don't have any entry in the phenotype-matched panel data result, which causes things to fall over.

Potential solutions:

  1. Is there something in metamist to do here? Should we have missing metadata for part of the cohort?
  2. Make reading personalised panels flexible, allowing for samples to be missing from this object? (will this have any implications on gene list filtering? probably...)
  3. Force every participant in the Pedigree into the personalised panels JSON, even if they only have the default panel added.
illusional commented 1 year ago

We have to support missing metadata, so unless you want to fail the dataset if any has missing data, I'd advocate against (1).

If it's critical to have metadata, then could the pipeline exclude them? What is the minimum set of metadata required to be actionable?

MattWellie commented 1 year ago

I was just throwing these points up as options, but only 3 is reasonable IMO. It makes sense that not all metadata is present, and the endpoint has a specific purpose to serve the metadata it does have. Serving empty objects for every participant just for this use case would be wild.

The whole 'per-person panels' behaviour is an optional extra, but I'm going to force a condition that if you provide personalised HPO-matched panels for any participant, you need to do the same for every affected participant in a cohort, even if this only contains a default panel (mendeliome). I've got a helper script that generates this data, and I've altered it to provide for all probands in the cohort, even those without metamist HPO data.

TL;DR don't sweat it, no Metamist changes required. Just spitballing changes, but I've solved this issue within AIP