human-pangenomics / hpp_production_workflows

WDL’s and Dockerfiles for assembly QC process
MIT License
54 stars 14 forks source link

HPP Production Workflows

This repository holds WDL workflows and Docker build scripts for production workflows for data QC, assembly generation, and assembly QC used by the Human Pangenome Reference Consortium.

All WDLs and containers created in this repository are licensed under the MIT license. The underlying tools (that the WDLs and containers run) are likely covered under one or more Free and Open Source Software licenses, but we cannot make any guarantees to that fact.


Repository Organization

Workflows are split across data_processing, assembly, and (assembly) QC folders; each with the following folder structure:

 ── docker/
    └── toolName/
        └── Dockerfile
        └── Makefile
        └── scripts/
            └── toolName/
                └── scriptName.py
 ── wdl/
    └── tasks/
    │   └── taskName.wdl
    └── workflows/
        └── workFlowName.wdl

The root level of the data_processing, assembly, and (assembly) QC folders each contain a readme that provides details about the workflows and how to use them. Summaries of the workflows in each area are below.


Workflow Types

Data Processing

The HPRC produces HiFi, ONT, and Illumina Hi-C data. Each data type has a workflow to check data files to ensure they pass QC.

Assembly

Assemblies are produced with one of two Hifiasm workflows using HiFi and ONT ultralong reads with phasing by either Illumina Hi-C or parental Illumina data for the Hi-C and trio workflows, respectively. The major steps included in the assembly workflows are:

In addition to the Hifiasm workflows there is an assembly cleanup workflow which:

Polishing

Assemblies are polished using a custom pipeline based around DeepPolisher. The polishing pipeline workflow wdl can be found at polishing/wdl/workflows/hprc_DeepPolisher.wdl. The major steps in the HPRC assembly polishing pipeline are:

QC

Automated Assembly QC

Assembly QC is broken down into two types:

The following tools are included in the standard_qc pipeline:

The following tools are included in the alignment_based_qc pipeline:


Running WDLs

If you haven't run a WDL before, there are good resources online to get started. You first need to choose a way to run WDLs. Below are a few options:

Running with Cromwell

Before starting, read the Cromwell 5 minute intro.

Once you've done that, download the latest version of cromwell and make it executable. (Replace XY with newest version number)

wget https://github.com/broadinstitute/cromwell/releases/download/86/cromwell-XY.jar
chmod +x cromwell-XY.jar

And run your WDL:

java -jar cromwell-XY.jar run \
   /path/to/my_workflow.wdl \
   -i my_workflow_inputs.json \
   > run_log.txt

Input files

Each workflow requires an input json. You can create a template using womtool:

java -jar womtool-XY.jar \
    inputs \
    /path/to/my_workflow.wdl \
    > my_workflow_inputs.json