fragmenting sites-only VCF to annotate with VEP in parallel
annotating with VEP
collating annotations into a single Hail Table
re-ingesting the VCF into a MT
applying annotations HT to the MT
This has been done to emulate the annotation cycle in the CPG pipeline, which starts from a VCF, fragments, annotates, and builds the complete dataset.
Whoa whoa whoa, that's a terrible idea.... Possible improvements:
Don't generate a regular VCF, just skip that completely. Instead, generate a sites-only and then apply annotations directly back into the MT
Don't generate a whole-genome VCF then fragment by genome region. Instead use the Hail parallel export to generate a set number of VCFs. This parallel export should be MUCH faster
Implement region filtering prior to export (this has been done)
Splitting a genome by regions results in unbalanced partitions. A MT repartition -> parallel export would result in equal sized fragments
Complications:
Hail's parallel export uses unpredictable names - there is a file written that serves as a manifest. The simplest version of this becomes a 2-step process:
A small method/pythonJob to parse the manifest txt file, find the Nth entry (depending on the job), and copy into the batch as a local file
Annotate that fragment
It's a lot of re-writing for functionality we will rarely use, is it worth it?
Current (re-) annotation process involves
This has been done to emulate the annotation cycle in the CPG pipeline, which starts from a VCF, fragments, annotates, and builds the complete dataset.
Whoa whoa whoa, that's a terrible idea.... Possible improvements:
Complications: