harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Automatically determine correct coverage parameters #15

Open tsackton opened 2 years ago

tsackton commented 2 years ago

At the moment, we leave the parameters that need to be changed as user-settable in the config file, with the default being the low coverage option.

In principle, it should be possible to have the default be 'auto', and determine based on the bam coverage stats what the correct values should be. This would be a useful enhancement to add.

This would require calculating mean or median coverage for all bams, and then developing a heuristic rule to call correct parameter values. A simple first enhancement would be to simply fix parameter values for all runs. A more sophisticated alternative would set parameters per-sample, but that would first require determining whether joint genotyping with samples called with different parameter values for GenotypeGVCFs will cause issues.

tsackton commented 1 year ago

As an update to this long-standing enhancement: we do now calculate overall mean coverage as well as the sample-to-sample standard deviation, so this should be a feasible upgrade to implement. We would need to figure out how to update the relevant parameters on the fly after the summary file is generated.