microbiomedata / metaT

Metatranscriptomics workflow
4 stars 2 forks source link
metatranscriptomics transcriptomics workflow

metaT: The Metatranscriptome Workflow

Summary

This workflow is designed to analyze metatranscriptomes.

metatranscriptomics workflow

Version

0.0.3

Third party tools and packages

To run this workflow you will need a Docker (Docker ≥ v2.1.0.3) instance and cromwell. All the third party tools are pulled from Dockerhub.

cromwell ≥ 54
bbtools ≥ v38.94
Python ≥ v3.7.6
featureCounts ≥ v2.0.2
R ≥ v3.6.0
edgeR ≥ v3.28.1 (R package)
pandas ≥ v1.0.5 (python package)
gffutils ≥ v0.10.1 (python package)

Databases

metaT uses the same database uses for metagenome annotation. See README here for required databases.For QC databases see here

Running workflow

In a server with shifter

The submit script will request a node and launch the Cromwell. The Cromwell manages the workflow by using Shifter to run applications.

java -Dconfig.file=wdls/shifter.conf -jar /full/path/to/cromwell-XX.jar run -i input.json /full/path/to/wdls/metaT.wdl

Docker images

Inputs

{
    "nmdc_metat.proj": "gold:Ga0370541",
    "nmdc_metat.input_file": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT/test_data/small_test/test_smaller_interleave.fastq.gz",
    "nmdc_metat.git_url": "https://github.com/microbiomedata/mg_annotation/releases/tag/0.1",
    "nmdc_metat.url_base": "https: //data.microbiomedata.org/data/",
    "nmdc_metat.outdir": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT/test_data/test_small_out",
    "nmdc_metat.resource": "NERSC - Cori",
    "nmdc_metat.url_root": "https://data.microbiomedata.org/data/",
    "nmdc_metat.database": "/global/cfs/cdirs/m3408/aim2/database/",
    "nmdc_metat.activity_id": "test-activity-id",
    "nmdc_metat.threads": 64,
    "nmdc_metat.metat_folder": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT"
}

Input option descriptions:

Outputs

All outputs can be found in the outdir folder. There are following subfolders:

Output JSON

The output file is a JSON formatted file called out.json with JSON records that contains RPKMs, reads, and information from annotation. An example JSON record:

        {
            "read_count": 2,
            "rpkm": 750750.751,
            "featuretype": "CDS",
            "seqid": "contig_3",
            "id": "contig_3_126_347",
            "source": "GeneMark.hmm_2 v1.05",
            "start": 126,
            "end": 347,
            "length": 222,
            "strand": "+",
            "frame": "0",
            "extra": [],
            "product": "hypothetical protein"
        }

Test

To test the workflow, we have provided a small test dataset and a step by step guidance. See test_data folder.