refactoring usage of intervals

tsackton commented 2 years ago

I think we need to reorganize our usage of interval lists in this workflow, because at the moment we are using them very sub-optimally and this is creating a lot of performance problems for large call sets, as well as a great deal of brittleness for updating runs with new samples.

From spending some time looking at the GATK workflows for hg38, I think there are a few lessons we can take, but probably we need to spend a bit more time thinking about how to generalize this widely:

As far as I can tell, there is no clear reason why we would need or want to use the same interval lists for the bam2gvcf step and for the subsequent joint genotyping steps. Requiring these intervals to be identical (as we do currently) makes the pipeline brittle, since it is no longer possible to straightforwardly increase the number of intervals to deal with problems with the GenotypeGVCF step in particular. A more robust procedure is to include a merge step after the HaplotypeCaller step, so we end with one gvcf per sample, not one per sample, per interval.
The way we handle interval lists in general is, I think, fairly poor, for several reasons. Ideally, we would have a master interval list for a given genome (in the results/refGenome dir, not results/species/refGenome/intervalLists, since this does not depend on the sample at all, just the genome). This should never fail to be created, even if (worse case) it is simply one interval per chromosome/scaffold. Then, for any particular tool, we can create lists from this master list based on need. This allows us to better optimize the size/splitting of interval lists for each tool, as the requirements are different. It will make things a bit more complicated as we need to create interval lists twice, once for bam2vcf and then again for the vcf2DB ->DB2vcf. But I think this will be extremely worthwhile for performance reasons (currently my impression is that for medium/large callsets, we create too many bam2vcf intervals, which greatly slows down Snakemake, and not enough DB2vcf intervals, which means this step sometimes doesn't finish).

In general we have gotten away with a suboptimal setup because we have mostly processed small callsets and mostly not really cared about the time things take, but neither of these are likely to remain true once we release this into the wild.

So to address this I think requires the following steps:

Rewrite the create intervals rule to create a master interval list, not a series of .list files, and to be designed to be fail-proof, so that a list is created no matter what. This is probably actually a substantial simplification of our existing algorithm, as it is simply a. matter of splitting the genome at Ns.
Write a new pre-bam2vcf step that creates interval lists specific to bam2vcf, and modify the bam2vcf rule to use these lists
Write a new post-bam2vcf step that merges the gvcfs per sample
Write a new pre-vcf2DB rule that creates new intervals for the vcf2DB and DB2vcf steps (and the filter/merge step)
Rewrite the vcf2DB, DB2vcf, and filter/sort/gather rules to use the new intervals; in the process we can probably also treat the genomicsDBI directories as temporary, as they will never be reused

What I am not yet totally sure about is exactly the optimal interval number and size for each step. It is clear that the interval number needs to scale with the number of samples to keep run times manageable, for example, and that we need a much greater number of intervals for the joint genotyping steps than the HaplotypeCaller step. I also believe that for the joint genotyping step there is not a strong reason to only split on Ns/gaps, while for the bam2vcf step this is more important (as HaplotypeCaller needs the local context). We also need to balance scalability to large sample sets with performance at more typical numbers, and it may be reasonable to design with a max sample set in the low thousands (what would be called a 'small sample size' in human genetics).

None of this should impact existing runs as nothing is changing other than interval steps, although we will probably want to force run the merge gvcf job for finished datasets for clean up purposes, and also force run the master interval list creation step, and probably manually delete/clean up/move results files as needed.

I do think it would be much better to get this right for the version that accompanies the paper.

cademirch commented 2 years ago

Thank you for writing this up, Tim. I've had issues with the intervals when running our datasets but haven't had the time to document it like you have here. We ended up sidestepping the issue by using Sentieon (obv not the solution here) which is why I probably haven't thought about this for a while.

I agree that the interval step should always work. I've written an algorithm that takes a bunch intervals and puts them into N roughly equal (by bp) groups. This is nice because then the number of intervals is a known constant at the beginning of the workflow.

I think it could also be worthwhile to make a test dataset that would be somewhat representative of real data that we could use and benchmark on.

I'll add more thoughts later, but I'm happy to help out with this where I can!

tsackton commented 2 years ago

Closed by #61

harvardinformatics / snpArcher

refactoring usage of intervals #55