populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Split Category 4 #24

Closed MattWellie closed 1 year ago

MattWellie commented 2 years ago

Currently Cat. 4 is designed to consider high impact in silico variants as well as de novo. This category should be split in the following way:

How will this work?

How will this be substantially different from the current Monoallelic check? It won't be... yet, but de novo and monoallelic will differ in terms of permitted penetrance.

i.e. logic can be adjusted to allow clinvar pathogenic variants to pass despite inheritance from parents, but under all circumstances parental inheritance is disqualifying for a de novo check

MattWellie commented 2 years ago

It was found before that Hail Query encountered pretty huge problems computing compound Hets, even on a small number of variants. The breaking of the current category 4 into two groups could hugely expand the number available at that point in the analysis process:

  1. The de novo side will have a relatively mild barrier to entry, with the valuable part of the evaluation coming as a result of the MOI test. Currently this is implemented outside of Hail, so there's a possibility that all Missense variants could be classified this way.
  2. The in silico side would be unaffected, resembling the current implementation (though the name will be shifted to Category_Support to better resemble its value in analysis

Instead of this, more de novo logic should be completed in Hail https://hail.is/docs/0.2/methods/genetics.html#hail.methods.de_novo

The full MOI validation should be completed in Hail, with Category4 variants labelled only where the MOI has been confirmed

MattWellie commented 2 years ago

PLAN! Hail implements a de novo function, which will make 2 new inputs mandatory:

  1. A PED file in PLINK (.fam) format (see pull #1)
  2. A Hail Table containing all the population Allele Frequencies (same content we are using in annotation, from the gs://cpg-seqr-reference bucket)

The procedure should then be:

  1. use the hl.Pedigree and hl.Table of allele frequencies, generate a Hail Table of the de novo calls, complete with their associated confidence scores and probabilities
  2. filter the hail table for the HIGH confidence calls only
  3. re-key the output table on only locus & alleles (matching the Matrix Table representation)
  4. aggregate all sample IDs per Locus, as an array (then compress as a string) ((really we're going to very surprised if more than one family shows the same de novo variant))
  5. annotate the sample ID String back on the MatrixTable as a class Flag (or 0 for all variants where no samples are de novo)
MattWellie commented 2 years ago

Note, the AF table doesn't need to be added as a separate input - once the annotations are added to the MT the annotations can be used self-referentially

MattWellie commented 2 years ago

Note: this PR is currently blocked by what appears to be a bug in Hail Query (https://hail.zulipchat.com/#narrow/stream/223457-Hail-Batch-support/topic/OutOfMemoryError.20in.20ServiceBackend.2ElowerDistributedSort/near/281636886). Waiting for a response on this from the core Hail team