harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

restricting variant calls #208

Open TonyKess opened 1 week ago

TonyKess commented 1 week ago

We're trying to use SNParcher on some fairly large genomic datasets at the moment (2000 + individuals), and I am anticipating a lot of slow-down at genotyping steps. I was wondering if it's possible to specify variant lists (e.g to genotype subsequent runs after doing SNP discovery in a smaller panel of individuals (e.g. ~500)? Is something like this already implemented? Or if not, any ideas on how to implement?

Thanks!

tsackton commented 1 week ago

Hi @TonyKess,

There are a few issues to consider here. Right now, by default we build suitable interval lists that encompass the entire genome, in order to speed up genotyping by parallelization. Conceptually, it would be possible to change this to support reduced interval sets, but my guess is this is probably not the most efficient solution. Typically for reasonable performance with GATK you would likely want to include a window of +/- 100 bp around each variable position (for context), so unless your SNP density is relatively low (e.g., SNPs are typically >>200 bp apart), you aren't actually going to see any advantage to using variant lists.

A likely better option is to use a tool that specifically supports explicit genotyping of known sites, such as freebayes or potentially something like varlociraptor. While these options are not currently implemented in snpArcher, this is something we would like to support in the future (although a specific timeline is hard to predict).

On the other hand, I've successfully run snpArcher on ~1500 codfish individuals in a couple of weeks, so it is plausible that you could potentially just genotype your whole dataset directly. The main limitation is that we still have some scalability issues with the final merge step where all the individual intervals are gathered into the final VCF, but we could easily prioritize solving this issue if you go this route.

In any case, because the most computationally costly step is generating the gVCF file with HaplotypeCaller, it would certainly make sense to start with a small dataset. If you decide to scale up and run snpArcher on the full dataset, you can reuse the gVCFs from the first run (snpArcher will do this automatically if you keep everything in the results directory); if you decide to pursue an alternative genotyping approach, you'll have your VCF to use from the first run.

Happy to help troubleshoot as you proceed, so please keep posting issues if you run into any problems.

Tim