populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

CPG annotation process isn't safe to use #287

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

See #269

Firstly, it is safe for the main pipeline to use:

  1. Generate a joint-called dataset
  2. Run VEP on multiallelic data
  3. Apply VEP data to variant data indexed only on Chr:Pos (partial key)
  4. Split multiallelic data out into separate variant rows (Hail's splitting function knows to correctly partition vep content)
  5. ...
  6. PROFIT!

The AIP re-annotation cycle starts from the post-split data generated by this pipeline; if two different alt alleles are present at the same locus, these are two different rows. Because multiple rows can have the same locus, the partial-key join between VEP and variant data is broken - this will lead to indexing clashes, and the annotation from one row will be wrongly applied to any other variants at the same site.

This has been seen in re-annotated AIP data (though this feature is rarely used internally), and in the re-annotation of the ClinVar VCF.

The only solution I can think of to solve this is to copy the chunks of cpg_workflows code relating to annotation, and alter how the data is pieced back together. That shouldn't be as bad as it sounds - the current code includes a ton of conditionals and switches for applying VQSR, or running in Dataproc. This version should be relatively minimal.