harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Intervals without Ns in reference #23

Closed cademirch closed 2 years ago

cademirch commented 2 years ago

Creating this issue to discuss/understand the interval generation for parallelization. I've been running the pipeline on genomes without Ns which prompted this. I wrote a function to split the genome into a user defined number of intervals to scatter on to over come this. I am curious about the reasoning behind splitting at Ns as opposed to arbitrary points along the chromosome/contig.

Appreciate any thoughts.

tsackton commented 2 years ago

The logic is that splitting on arbitrary point introduces boundary issues with read mapping, which then can cause problems with variant calling.

One way to avoid this is to have some number of bases overlap between intervals, which becomes less efficient as the number of intervals increases and also makes the gather operations a lot more complicated.

Another possible solution may be to split on other regions that are problematic in some way, e.g. split on low mappability segments, figuring that boundary issues won't matter because these regions will be filtered out anyway.

cademirch commented 2 years ago

Based on meeting we discussed that if no N's then chromosomes will be the intervals. Will be working on this.