populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Re-work VEP content #288

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Fixes

Issue

chr1:12345 C>T
chr1:12345 C>TGAGTAGATC
chr1:12345 C>A

Proposed Changes

Takes all of the methods we use from the cpg_workflows library and duplicates into AIP with a few changes:

Being able to correctly annotate variants split out into individual alleles is a prerequisite for using the re-annotation feature, and for being able to generate new PM5/Annotated ClinVar datasets

Checklist

cassimons commented 1 year ago

Does this mean we will still be using the VEP data from the prod pipeline if it is available or well we need to re-run VEP for each dataset?

MattWellie commented 1 year ago

Does this mean we will still be using the VEP data from the prod pipeline if it is available

Yup! The first pass through is fine because GATK calls in a multi-allelic way and the process is designed to complement that. The issue only crops up because the splitting is non-reversible, so any subsequent re-annotations are on differently formatted starting data.

MattWellie commented 1 year ago

To check that this annotation implementation works, I've taken a subset of the schr-neuro dataset (chr17 only)

The original whole genome results are here: https://main-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-03-13/summary_output.html

The subset results are here: https://test-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-05-03/summary_output.html

all chr17 variants are presented the same, with identical annotation. I'd also like to re-run the Clinvar annotation to check that the original issue is solved. There's tons of variants which share alleles in that dataset, so it should be a good test case. The only issue is how dodgy the ClinVar VCF formatting is...