Make better use of Hail Features

Current (re-) annotation process involves

converting MT to a VCF
also converting MT to a sites-only VCF
fragmenting sites-only VCF to annotate with VEP in parallel
annotating with VEP
collating annotations into a single Hail Table
re-ingesting the VCF into a MT
applying annotations HT to the MT

This has been done to emulate the annotation cycle in the CPG pipeline, which starts from a VCF, fragments, annotates, and builds the complete dataset.

Whoa whoa whoa, that's a terrible idea.... Possible improvements:

Don't generate a regular VCF, just skip that completely. Instead, generate a sites-only and then apply annotations directly back into the MT
Don't generate a whole-genome VCF then fragment by genome region. Instead use the Hail parallel export to generate a set number of VCFs. This parallel export should be MUCH faster
Implement region filtering prior to export (this has been done)
Splitting a genome by regions results in unbalanced partitions. A MT repartition -> parallel export would result in equal sized fragments

Complications:

Hail's parallel export uses unpredictable names - there is a file written that serves as a manifest. The simplest version of this becomes a 2-step process:
- A small method/pythonJob to parse the manifest txt file, find the Nth entry (depending on the job), and copy into the batch as a local file
- Annotate that fragment
It's a lot of re-writing for functionality we will rarely use, is it worth it?

populationgenomics / automated-interpretation-pipeline

Make better use of Hail Features #333