harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

feature request: give a bcftools option #186

Open andrewkern opened 1 month ago

andrewkern commented 1 month ago

hey @tsackton and the snpArcher team-

our lab group recently came across this paper suggesting that bcftools might be better suited for non-model organism SNP calling.

Would you all have any bandwidth to implement a bcftools snp calling option in addition to what the pipeline has now?

thanks for considering this! Andy

tsackton commented 1 month ago

Hey Andy,

Implementing options to use alternative SNP callers (e.g., bcftools, DeepVariant, FreeBayes, potentially others) is on our roadmap, as is the ability to run multiple calling options on the same samples to, e.g., bootstrap high confidence SNP sets. bcftools is the easiest of these and likely the first one we'll implement, but we don't have a timeline for when we might have bandwidth for this (although contributions are always welcome!). The main challenge is figuring out an appropriate sharding strategy for parallelization that works well with bcftools, and figuring out how (and if it is possible) to scale so you can still efficiently genotype relatively large sample sizes in a reasonable time.

FWIW and as an aside, I'm pretty skeptical of that paper you link - I think there is something wrong with their bowtie2 mappings, for one (75% mapping rate is very low, especially given their maximum divergence is 2%). They also don't simulate indels, which means one of the limitations of bcftools in real data (failure to do local reassemble around potentially misaligned regions) is erased, making it look better than it (probably) is.

In any case, feature request noted and this is definitely on our list albeit without a definite timeline. I'll keep this open to remind us to think about the implementation though!

Tim