populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

No Really, Remove Hail Reliance? #312

Open MattWellie opened 9 months ago

MattWellie commented 9 months ago

Current workflow:

*Note: workflow orchestration is all done using Hail Batch

Why it's good

Why it's bad

Can it be changed...


The Massive Re-Write Proposal 😓

Key decision drivers:


Workflow (assuming annotation is required)

  1. Take VCF data and fragment into intervals (similar to current process)
  2. Scatter VEP tasks, running a VCF -> Annotated VCF per interval
  3. On each annotated VCF fragment run a labelling workflow (plain Python implementation of the current process, reading variant data in, filtering/labelling, and writing a new minimised VCF output)
  4. Gather all labelled fragments into a single VCF
  5. Continue on to existing downstream process, no modification required

Hail-specific things that would need changing

ISSUES

So... we're catastrophically tangled with various CPG libraries, each of which has some reliance on Hail. We want to get away from Hail completely...

That will require altering everything, right down to how output paths are being generated. That will be a huge pain in the ass.