populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Make better use of Hail Features #333

Open MattWellie opened 6 months ago

MattWellie commented 6 months ago

Current (re-) annotation process involves

This has been done to emulate the annotation cycle in the CPG pipeline, which starts from a VCF, fragments, annotates, and builds the complete dataset.

Whoa whoa whoa, that's a terrible idea.... Possible improvements:

  1. Don't generate a regular VCF, just skip that completely. Instead, generate a sites-only and then apply annotations directly back into the MT
  2. Don't generate a whole-genome VCF then fragment by genome region. Instead use the Hail parallel export to generate a set number of VCFs. This parallel export should be MUCH faster
  3. Implement region filtering prior to export (this has been done)
  4. Splitting a genome by regions results in unbalanced partitions. A MT repartition -> parallel export would result in equal sized fragments

Complications:

  1. Hail's parallel export uses unpredictable names - there is a file written that serves as a manifest. The simplest version of this becomes a 2-step process:
    • A small method/pythonJob to parse the manifest txt file, find the Nth entry (depending on the job), and copy into the batch as a local file
    • Annotate that fragment
  2. It's a lot of re-writing for functionality we will rarely use, is it worth it?
MattWellie commented 2 months ago

relevant https://github.com/populationgenomics/production-pipelines/pull/629