MattWellie commented 1 year ago

Fixes

Relevant for #287 #269 #281 #268

Issue

The main pipeline's annotation code doesn't work for the AIP use-case
The annotation code in cpg_workflows expects multiallelic data, such that each locus only appears once per dataset.
VEP is executed on multiallelic data, joined with Locus as the only common key, then the variants are split to have a different alt allele on each 'row'. Hail accurately partitions the VEP annotations corresponding to each alt allele.
AIP imports from this annotation library and allows for variant data to be annotated/re-annotated. As this data is (in our use-case) already split into individual alt alleles, the Locus-only join can cause problems where all alt alleles at the same locus will have the same annotation applied to each, e.g. in this situation, annotation from one of these variants would be wrongly applied to all 3:

chr1:12345 C>T
chr1:12345 C>TGAGTAGATC
chr1:12345 C>A

Proposed Changes

Takes all of the methods we use from the cpg_workflows library and duplicates into AIP with a few changes:

No VQSR data applied
VEP data is indexed and joined on both locus and alleles, meaning that it can safely annotate a split VCF/MT
None of the computed Hail scripts fields are handled (e.g. sortedTranscriptConsequences) because AIP doesn't use them. This should make the runtime and data footprint much smaller
No switches to enable running in dataproc, or sorting/tabix'ing an intermediate VCF

Being able to correctly annotate variants split out into individual alleles is a prerequisite for using the re-annotation feature, and for being able to generate new PM5/Annotated ClinVar datasets

Checklist

[x] Related Issue created
[ ] Tests covering new change
[ ] Linting checks pass

cassimons commented 1 year ago

Does this mean we will still be using the VEP data from the prod pipeline if it is available or well we need to re-run VEP for each dataset?

MattWellie commented 1 year ago

Does this mean we will still be using the VEP data from the prod pipeline if it is available

Yup! The first pass through is fine because GATK calls in a multi-allelic way and the process is designed to complement that. The issue only crops up because the splitting is non-reversible, so any subsequent re-annotations are on differently formatted starting data.

MattWellie commented 1 year ago

To check that this annotation implementation works, I've taken a subset of the schr-neuro dataset (chr17 only)

The original whole genome results are here: https://main-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-03-13/summary_output.html

The subset results are here: https://test-web.populationgenomics.org.au/schr-neuro/reanalysis/2023-05-03/summary_output.html

all chr17 variants are presented the same, with identical annotation. I'd also like to re-run the Clinvar annotation to check that the original issue is solved. There's tons of variants which share alleles in that dataset, so it should be a good test case. The only issue is how dodgy the ClinVar VCF formatting is...

populationgenomics / automated-interpretation-pipeline

Re-work VEP content #288

Fixes

Issue

Proposed Changes

Checklist