This adds the component parts of a simplified annotation workflow - take a VCF, coarsely split, annotate each section (writing an annotated VCF for each fragment), and merge the results
These are the building blocks for a ClinvArbitration workflow we can schedule to run regularly:
re-summarise ClinVar data using different decision logic
annotate pathogenic SNVs (this PR)
using the protein annotation, produce a PM5 resource (evidence category for 'if this missense is pathogenic, we treat other AA changes at the same codon with suspicion)
The current annotation logic in PP is not insufficient, but it is a burden... we by default annotate using everything (literally --everything), plus a number of annotation plugins (AlphaMissense, LOFTEE) which are not relevant to this work. Those plugins are really slow, VEP in general is really slow.
With these changes I can get a ~200k variant VCF of SNVs split, annotated, and re-merged in 4 mins. EZ.
Follow-on from this would be putting in a ClinvArbitration workflow to run through the whole process:
collect latest clinvar
resummarise
annotate
make PM5 table
write those results into GCP for use in Talos
... eventually also zip and bundle that data into a release on the ClinvArbitration repo so we can share these results in the community
I've also abstracted the generic 'take my VCFs and make one VCF' process into its own method - we use that here, in gCNV, in seqr-loader, in the long read pipeline. We should write common generic methods like this once in a Hail-Batch'y way, and re-use... I think
Successful run here https://batch.hail.populationgenomics.org.au/batches/457244
This adds the component parts of a simplified annotation workflow - take a VCF, coarsely split, annotate each section (writing an annotated VCF for each fragment), and merge the results
These are the building blocks for a ClinvArbitration workflow we can schedule to run regularly:
The current annotation logic in PP is not insufficient, but it is a burden... we by default annotate using everything (literally
--everything
), plus a number of annotation plugins (AlphaMissense, LOFTEE) which are not relevant to this work. Those plugins are really slow, VEP in general is really slow.With these changes I can get a ~200k variant VCF of SNVs split, annotated, and re-merged in 4 mins. EZ.
Follow-on from this would be putting in a ClinvArbitration workflow to run through the whole process:
I've also abstracted the generic 'take my VCFs and make one VCF' process into its own method - we use that here, in gCNV, in seqr-loader, in the long read pipeline. We should write common generic methods like this once in a Hail-Batch'y way, and re-use... I think