harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
73 stars 33 forks source link

restricting variant calls #208

Open TonyKess opened 4 months ago

TonyKess commented 4 months ago

We're trying to use SNParcher on some fairly large genomic datasets at the moment (2000 + individuals), and I am anticipating a lot of slow-down at genotyping steps. I was wondering if it's possible to specify variant lists (e.g to genotype subsequent runs after doing SNP discovery in a smaller panel of individuals (e.g. ~500)? Is something like this already implemented? Or if not, any ideas on how to implement?

Thanks!

tsackton commented 4 months ago

Hi @TonyKess,

There are a few issues to consider here. Right now, by default we build suitable interval lists that encompass the entire genome, in order to speed up genotyping by parallelization. Conceptually, it would be possible to change this to support reduced interval sets, but my guess is this is probably not the most efficient solution. Typically for reasonable performance with GATK you would likely want to include a window of +/- 100 bp around each variable position (for context), so unless your SNP density is relatively low (e.g., SNPs are typically >>200 bp apart), you aren't actually going to see any advantage to using variant lists.

A likely better option is to use a tool that specifically supports explicit genotyping of known sites, such as freebayes or potentially something like varlociraptor. While these options are not currently implemented in snpArcher, this is something we would like to support in the future (although a specific timeline is hard to predict).

On the other hand, I've successfully run snpArcher on ~1500 codfish individuals in a couple of weeks, so it is plausible that you could potentially just genotype your whole dataset directly. The main limitation is that we still have some scalability issues with the final merge step where all the individual intervals are gathered into the final VCF, but we could easily prioritize solving this issue if you go this route.

In any case, because the most computationally costly step is generating the gVCF file with HaplotypeCaller, it would certainly make sense to start with a small dataset. If you decide to scale up and run snpArcher on the full dataset, you can reuse the gVCFs from the first run (snpArcher will do this automatically if you keep everything in the results directory); if you decide to pursue an alternative genotyping approach, you'll have your VCF to use from the first run.

Happy to help troubleshoot as you proceed, so please keep posting issues if you run into any problems.

Tim

TonyKess commented 4 months ago

Hi Tim,

Thanks for this! We are just getting to a point now for some projects where my ad-hoc bash scripts are no longer "safe" for large sample sizes or weird genomes, or a combo of both. I will try to go in the direction of genotyping g.vcfs as data from different projects get integrated. Follow up questions around speed - any idea if it's possible (or useful) to integrate GLnexus for .g.vcf genotyping?

tsackton commented 4 months ago

Integrating GLnexus would be a very interesting addition. It is not currently on our roadmap; our current focus has been Sention for speedup. But as that requires a commercial license it will not be accessible to all users. I haven't looked in detail at benchmarking comparisons of GLnexus vs GATK GenotypeGVCFs, to be honest, but based on the paper it seems reasonable to assume it would be an improvement.

Conceptually this would actually not be that difficult to implement, I don't think. It would basically replace the GATK DB and GenotypeGVCF rules with a GLnexus rule. It might require a tiny bit of refactoring to simplify the rule selection by config, but not too difficult.

We currently don't have a lot of spare developer time but I'll keep this issue open. And of course we are always happy to welcome contributions and pull requests, if you have any interest.