VQSR not fully implemented

snakemake-workflows / dna-seq-gatk-variant-calling

This Snakemake pipeline implements the GATK best-practices workflow

MIT License

242 stars 147 forks source link

VQSR not fully implemented #35

Open lczech opened 2 years ago

lczech commented 2 years ago

It seems that the VQSR (GATK VariantRecalibrator) step here is not fully implemented. That rule only produces the model built with the VariantRecalibrator, but does not use ApplyVQSR to actually perform the filtering.

That means, the output files produced by that rule, all.{vartype}.recalibrated.vcf.gz, are in fact the recal files used by ApplyRecalibration, but not actually the filtered VCFs that we are looking for. Hence, an additional rule needs to be implemented that runs the gatk/applyvqsr wrapper.

If that helps, I've implemented this in my pipeline, which also offers to set all the resource files needed for the VariantRecalibrator to be provided via the config file, so that users don't have to edit the snakemake files.

dlaehnemann commented 2 years ago

Well spotted and thanks for documenting this in so much detail, with good suggestions. You and anybody using this pipeline is welcome to implement this via a PR, and we will look into implementing this if we can find some time. But we are not actively using this pipeline ourselves, so this might be a while.

For further inspiration, the base quality score recalibration (BQSR) setup might also be a good place to look, especially as this uses some of the respective files available in this workflow:

https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/cffa77a9634801152971448bd41d0687cf765723/workflow/rules/mapping.smk#L65-L101

lczech commented 2 years ago

Thanks! I am also not actively using this pipeline here, as I'm developing my own that already fixes the issue, as stated above, so I won't have much time to work in this either here :-(

I also noted the BQSR setup has some similarity in GATK - although it seems that the wrappers for that changed. In previous versions, both steps of BQSR were combined into one wrapper, and only split into two later on. That might have been the reason why this was missed in the pipeline here.

dlaehnemann commented 2 years ago

Sure, that could be a root cause here. Or simply overlooking some details...

Does your workflow above follow the GATK best practices and do you actively maintain it? It might be worth deprecating this workflow in favor of yours, if yours ticks the right boxes...

Also, I realized yours isn't yet listed in the Snakemake Workflow Catalog. That catalog is basically a nightly GitHub crawler that aggregates all snakemake workflows that adhere either to its inclusion criteria. And if you walk the couple of extra meters to enable standardized usage, you even get a quick and easy deployment help for you workflow that you can for example cite in its README.md. As a good example, see the usage info for our dna-seq-varlociraptor workflow.

lczech commented 2 years ago

Does your workflow above follow the GATK best practices and do you actively maintain it?

As far as I am aware, it does indeed implement the best practices. And yes, at least for a while I plan to maintain it. We are currently working on a publication describing the pipeline as well.

Also, I realized yours isn't yet listed in the Snakemake Workflow Catalog.

Indeed, I'd love to have it listed in the catalog! I'll hopefully find some time to walk these extra meters soon ;-)