Closed MattWellie closed 1 year ago
Does this mean we will still be using the VEP data from the prod pipeline if it is available or well we need to re-run VEP for each dataset?
Does this mean we will still be using the VEP data from the prod pipeline if it is available
Yup! The first pass through is fine because GATK calls in a multi-allelic way and the process is designed to complement that. The issue only crops up because the splitting is non-reversible, so any subsequent re-annotations are on differently formatted starting data.
To check that this annotation implementation works, I've taken a subset of the schr-neuro
dataset (chr17 only)
The original whole genome results are here: https://main-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-03-13/summary_output.html
The subset results are here: https://test-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-05-03/summary_output.html
all chr17 variants are presented the same, with identical annotation. I'd also like to re-run the Clinvar annotation to check that the original issue is solved. There's tons of variants which share alleles in that dataset, so it should be a good test case. The only issue is how dodgy the ClinVar VCF formatting is...
Fixes
Issue
cpg_workflows
expects multiallelic data, such that each locus only appears once per dataset.Proposed Changes
Takes all of the methods we use from the cpg_workflows library and duplicates into AIP with a few changes:
Being able to correctly annotate variants split out into individual alleles is a prerequisite for using the re-annotation feature, and for being able to generate new PM5/Annotated ClinVar datasets
Checklist