harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

refactoring usage of intervals #55

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

I think we need to reorganize our usage of interval lists in this workflow, because at the moment we are using them very sub-optimally and this is creating a lot of performance problems for large call sets, as well as a great deal of brittleness for updating runs with new samples.

From spending some time looking at the GATK workflows for hg38, I think there are a few lessons we can take, but probably we need to spend a bit more time thinking about how to generalize this widely:

In general we have gotten away with a suboptimal setup because we have mostly processed small callsets and mostly not really cared about the time things take, but neither of these are likely to remain true once we release this into the wild.

So to address this I think requires the following steps:

  1. Rewrite the create intervals rule to create a master interval list, not a series of .list files, and to be designed to be fail-proof, so that a list is created no matter what. This is probably actually a substantial simplification of our existing algorithm, as it is simply a. matter of splitting the genome at Ns.
  2. Write a new pre-bam2vcf step that creates interval lists specific to bam2vcf, and modify the bam2vcf rule to use these lists
  3. Write a new post-bam2vcf step that merges the gvcfs per sample
  4. Write a new pre-vcf2DB rule that creates new intervals for the vcf2DB and DB2vcf steps (and the filter/merge step)
  5. Rewrite the vcf2DB, DB2vcf, and filter/sort/gather rules to use the new intervals; in the process we can probably also treat the genomicsDBI directories as temporary, as they will never be reused

What I am not yet totally sure about is exactly the optimal interval number and size for each step. It is clear that the interval number needs to scale with the number of samples to keep run times manageable, for example, and that we need a much greater number of intervals for the joint genotyping steps than the HaplotypeCaller step. I also believe that for the joint genotyping step there is not a strong reason to only split on Ns/gaps, while for the bam2vcf step this is more important (as HaplotypeCaller needs the local context). We also need to balance scalability to large sample sets with performance at more typical numbers, and it may be reasonable to design with a max sample set in the low thousands (what would be called a 'small sample size' in human genetics).

None of this should impact existing runs as nothing is changing other than interval steps, although we will probably want to force run the merge gvcf job for finished datasets for clean up purposes, and also force run the master interval list creation step, and probably manually delete/clean up/move results files as needed.

I do think it would be much better to get this right for the version that accompanies the paper.

cademirch commented 2 years ago

Thank you for writing this up, Tim. I've had issues with the intervals when running our datasets but haven't had the time to document it like you have here. We ended up sidestepping the issue by using Sentieon (obv not the solution here) which is why I probably haven't thought about this for a while.

I agree that the interval step should always work. I've written an algorithm that takes a bunch intervals and puts them into N roughly equal (by bp) groups. This is nice because then the number of intervals is a known constant at the beginning of the workflow.

I think it could also be worthwhile to make a test dataset that would be somewhat representative of real data that we could use and benchmark on.

I'll add more thoughts later, but I'm happy to help out with this where I can!

tsackton commented 2 years ago

Closed by #61