lcdb / lcdb-wf

Robust, tested workflows for RNA-seq, ChIP-seq and other high-throughput sequencing analysis
https://lcdb.github.io/lcdb-wf
20 stars 17 forks source link

variant calling workflow + testing #371

Open fridells51 opened 1 year ago

fridells51 commented 1 year ago

With this pull request, an end-to-end Snakemake variant calling workflow will be added to lcdb-wf. The Snakefile handles references, mapping reads to the genome, QC, and includes a GATK best practices pipeline for germline and somatic variant calling. The workflow supports whole genome sequencing (WGS) and targeted sequencing inputs and returns analysis-ready, annotated VCFs.

Included in this PR is an update to the conda environment to include packages for variant calling. The lcdb-wf docs are also updated to include a comprehensive overview of the workflow as well as detailing several configuration options that the user can interact with in order to tweak the workflow for their analysis needs. The workflow is not organism-specific and the docs detail how to call variants on non-human organisms. References can be provided to the workflow externally, but this PR will also expand the existing references workflow in lcdb-wf to automatically include new reference types necessary for variant calling.

The VCF annotation portion of the workflow supports attaching annotations from databases like dbNSFP using SnpEff.

The workflow will also run MultiQC to aggregate QC checks on input fastq data, variant calling metrics, and annotation summary files.

Test data for variant calling have been generated and are hosted on https://github.com/lcdb/lcdb-wf-variant-calling-test-data. This test data is run on the workflow using circle ci to test conda environments and workflow execution when new changes are made to the workflow. This protects against deprecation and introducing bugs into the workflow with future updates.