No Really, Remove Hail Reliance?

Current workflow:

Start with/create a Hail MatrixTable of annotated variant data
- If we start with un-annotated data we run VEP on VCF-format data, and rebuild the annotations & variants into a MatrixTable
Reduce this annotated data by identifying plausibly interesting variants
Further review interesting variants by assessing against MOI relevant to each gene in the analysis

*Note: workflow orchestration is all done using Hail Batch

Hail's MT format provides massively parallelised analysis, allowing whole-genome analysis in reasonable time
Our pipeline already generates MT data and relies on Hail, so the burden for us is minimal

Nobody else uses Hail Batch/Query, so the hurdles to deploying in other settings are huge
Reliance on proprietary data structures means that without Hail & assc. infrastructure, this is unusable for anyone else
Everything down to our cpg-utils function library is reliant on Hail

The Massive Re-Write Proposal 😓

Key decision drivers:

Tools have to be commonly in use within the industry, or something that can be installed/deployed without too much fuss
Setup burden for non-CPG sites should be minimal
Results should be equivalent to the current implementation
Costs should not be astronomical (in absolute terms, or compared to the current version)

Workflow (assuming annotation is required)

Take VCF data and fragment into intervals (similar to current process)
Scatter VEP tasks, running a VCF -> Annotated VCF per interval
On each annotated VCF fragment run a labelling workflow (plain Python implementation of the current process, reading variant data in, filtering/labelling, and writing a new minimised VCF output)
Gather all labelled fragments into a single VCF
Continue on to existing downstream process, no modification required

Hail-specific things that would need changing

de Novo detection workflow, currently using a Hail Builtin function. This wouldn't be too hard to emulate, and probably shouldn't be a decision driver
on-the-fly annotation using a custom version of ClinVar - no biggie, we can take the same ClinVar decisions and add them as an additional source for VEP instead of applying them after annotation. This is likely to only affect us, and we probably won't be using this version of the AIP
PM5 detection - this can be done using a lookup dictionary, would be pretty resource intensive so may need to investigate how realistic this is. If this could also be provided as a VEP annotation source that would be awesome, but I suspect it won't be easy (VEP applying annotations based on the consequence annotations applied by VEP 🤯)

So... we're catastrophically tangled with various CPG libraries, each of which has some reliance on Hail. We want to get away from Hail completely...

That will require altering everything, right down to how output paths are being generated. That will be a huge pain in the ass.