Interval refactor - Githubissues

cademirch commented 2 years ago

Refactoring how we generate and use intervals. Currently, we use the same set of interval lists for gvcf generation and vcf generation. However, this is inefficient as vcf generation can become extremely slow with many samples and the large intervals that gvcf generation requires. So, this PR solves this by creating separate intervals for bam2gvcf and gvcf2vcf.

bam2gvcf: The output from picard splitbyNs is read into a python script that creates equal interval lists for bam2gvcf. The number of lists of intervals created is controlled by config["maxNumIntervals"]. These intervals are used to create gvcf which are then combined to a final sample gvcf.

gvcf2vcf: We calculate the number of db interval lists to create using the following equation config["maxNumIntervals"] * config["db_scatter_factor"] * num_samples

@tsackton Does this seem appropriate to you? This was my best idea to get gatk SplitIntervals to make the right number of interval lists

The interval lists are created by gatk SplitIntervals which subdivides the larger intervals to create the desired amount of interval lists, which should speed up DB and VCF build times.

To do

[ ] Change config["maxNumIntervals"] to something more accurate, like num_gvcf_interval_lists
[ ] Make create_db_intervals not a checkpoint, since it should always output the specified num of lists. Though it could be safer to keep it a checkpoint.

tsackton commented 2 years ago

I am running some tests with this branch now, after putting this through its paces on a substantial real dataset I will add comments/bugfixes if needed, and merge.

tsackton commented 2 years ago

Putting a few notes/ideas here as I do some test runs.

First comment is that we probably still want to use one subdirectory per biosample for the gVCF temporary output, just because if lots of these are running at once you can still get some very large directories, even though these will shrink as the pipeline runs. Also easier to check progress since only finished g.vcf files will be in the top-level (03_gvcf) directory.

harvardinformatics / snpArcher

Interval refactor #61