populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Simplified VEP annotation #806

Closed MattWellie closed 2 days ago

MattWellie commented 6 days ago

Successful run here https://batch.hail.populationgenomics.org.au/batches/457244

This adds the component parts of a simplified annotation workflow - take a VCF, coarsely split, annotate each section (writing an annotated VCF for each fragment), and merge the results

These are the building blocks for a ClinvArbitration workflow we can schedule to run regularly:

The current annotation logic in PP is not insufficient, but it is a burden... we by default annotate using everything (literally --everything), plus a number of annotation plugins (AlphaMissense, LOFTEE) which are not relevant to this work. Those plugins are really slow, VEP in general is really slow.

With these changes I can get a ~200k variant VCF of SNVs split, annotated, and re-merged in 4 mins. EZ.

Follow-on from this would be putting in a ClinvArbitration workflow to run through the whole process:

I've also abstracted the generic 'take my VCFs and make one VCF' process into its own method - we use that here, in gCNV, in seqr-loader, in the long read pipeline. We should write common generic methods like this once in a Hail-Batch'y way, and re-use... I think