griffithlab / epics

The project management repo of the Griffith Lab
0 stars 0 forks source link

Create end-to-end WGS pipeline #11

Open ahwagner opened 6 years ago

ahwagner commented 6 years ago

We will develop a comprehensive, ‘end-to-end’ WGS analysis pipeline that accepts raw sequence data and generates aligned sequence reads, variant call files, and text summaries of sequence data and analysis QC metrics. This workflow will be completely automated using publicly available workflow management tools (https://www.commonwl.org/ and https://software.broadinstitute.org/wdl/), and will perform the following steps:

1) Sequence data alignment and QC. Raw sequence data will be aligned to the GRCh38 human reference assembly and processed to mark duplicates (Picard), perform base quality recalibration (GATK), and screen for contamination (VerifyBamID2). This workflow will conform to the “functional equivalency” standard established by MGI and the NHGRI Genome Sequencing Program Data Working Group34. 2) Germline variant identification and QC. Germline variant calling will be performed using a suite of tools geared to detect germline variants. Specifically, SNVs and small indels will be detected with GATK (version 4) using documented best practices. Structural variation (>50 bp) will be identified using multiple approaches including both discordant read-pair and split-read analysis. SNV calls will be assessed for transition/transversion ratio, and screened for cross-contamination. 3) Variant annotation. Identified variants will be annotated with gene and genomic context, population allele frequency, clinical annotations included in ClinVar and clinical relevance using the CIViC API. 4) Result reporting. WGS results and variant calls will be assembled into a report that contains QC information and variant calls in a tiered format for review and further analysis by clinical investigators. All analysis pipelines will be specified in a formal workflow language (CWL or WDL) and required software will be assembled into analysis containers (e.g., docker). The specifications, descriptions, and source code for these will be provided using open- source licenses via public code repositories (in GitHub). Importantly, this means that other centers will be able to rapidly establish identical versions of our validated pipelines in their own clinical sequencing laboratories.