Apply VEP data to variant data indexed only on Chr:Pos (partial key)
Split multiallelic data out into separate variant rows (Hail's splitting function knows to correctly partition vep content)
...
PROFIT!
The AIP re-annotation cycle starts from the post-split data generated by this pipeline; if two different alt alleles are present at the same locus, these are two different rows. Because multiple rows can have the same locus, the partial-key join between VEP and variant data is broken - this will lead to indexing clashes, and the annotation from one row will be wrongly applied to any other variants at the same site.
This has been seen in re-annotated AIP data (though this feature is rarely used internally), and in the re-annotation of the ClinVar VCF.
The only solution I can think of to solve this is to copy the chunks of cpg_workflows code relating to annotation, and alter how the data is pieced back together. That shouldn't be as bad as it sounds - the current code includes a ton of conditionals and switches for applying VQSR, or running in Dataproc. This version should be relatively minimal.
See #269
Firstly, it is safe for the main pipeline to use:
Chr:Pos
(partial key)The AIP re-annotation cycle starts from the post-split data generated by this pipeline; if two different alt alleles are present at the same locus, these are two different rows. Because multiple rows can have the same locus, the partial-key join between VEP and variant data is broken - this will lead to indexing clashes, and the annotation from one row will be wrongly applied to any other variants at the same site.
This has been seen in re-annotated AIP data (though this feature is rarely used internally), and in the re-annotation of the ClinVar VCF.
The only solution I can think of to solve this is to copy the chunks of cpg_workflows code relating to annotation, and alter how the data is pieced back together. That shouldn't be as bad as it sounds - the current code includes a ton of conditionals and switches for applying VQSR, or running in Dataproc. This version should be relatively minimal.