Closed cademirch closed 2 years ago
I am running some tests with this branch now, after putting this through its paces on a substantial real dataset I will add comments/bugfixes if needed, and merge.
Putting a few notes/ideas here as I do some test runs.
First comment is that we probably still want to use one subdirectory per biosample for the gVCF temporary output, just because if lots of these are running at once you can still get some very large directories, even though these will shrink as the pipeline runs. Also easier to check progress since only finished g.vcf files will be in the top-level (03_gvcf) directory.
Refactoring how we generate and use intervals. Currently, we use the same set of interval lists for gvcf generation and vcf generation. However, this is inefficient as vcf generation can become extremely slow with many samples and the large intervals that gvcf generation requires. So, this PR solves this by creating separate intervals for bam2gvcf and gvcf2vcf.
bam2gvcf: The output from picard splitbyNs is read into a python script that creates equal interval lists for bam2gvcf. The number of lists of intervals created is controlled by
config["maxNumIntervals"]
. These intervals are used to create gvcf which are then combined to a final sample gvcf.gvcf2vcf: We calculate the number of db interval lists to create using the following equation
config["maxNumIntervals"] * config["db_scatter_factor"] * num_samples
The interval lists are created by gatk SplitIntervals which subdivides the larger intervals to create the desired amount of interval lists, which should speed up DB and VCF build times.
To do
config["maxNumIntervals"]
to something more accurate, likenum_gvcf_interval_lists
create_db_intervals
not a checkpoint, since it should always output the specified num of lists. Though it could be safer to keep it a checkpoint.