jiwoongbio / FMAP

Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies
Other
26 stars 12 forks source link

FMAP

Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies

Some example results are available at the homepage: https://qbrc.swmed.edu/FMAP/.

Features

Requirements

Command usages

Options: -h display this help message -s switch database -r redownload data


* **FMAP_prepare.pl**

Usage: perl FMAP_prepare.pl [options]

Options: -h display this help message -r redownload data -m FILE executable file path of mapping program, "diamond" or "usearch" [diamond] -k download prebuilt KEGG files


* **FMAP_assembly.pl**
  * Process
    ![](FMAP_assembly.process.png)
    * Read mapping: nucleotide sequence alignment using [BWA](http://bio-bwa.sourceforge.net)
    * ORF mapping: protein sequence alignment using [DIAMOND](http://ab.inf.uni-tuebingen.de/software/diamond/)
  * Input
    1. Prefix of output files
    2. De novo assembled sequences in FASTA format
       * A FASTA file can be generated by metagenome assemblers such as [SPAdes](http://cab.spbu.ru/software/spades/) and [MetaVelvet](http://metavelvet.dna.bio.keio.ac.jp).
       * A FASTA file containing target genome sequences can be input instead.
    3. Whole metagenomic/metatranscriptomic shotgun sequencing reads in FASTQ or FASTA format
       * Multiple read files can be specified.
       * Paired-end read files must be specified comma-separated like "input.R1.fastq,input.R2.fastq".
       * The read files can be compressed by gzip.
  * Output
    1. Prefix.region.abundance.txt (abundances of ORF regions mapping to KEGG orthologies)
    2. Prefix.abundance.txt (abundances of KEGG orthologies)

Usage: perl FMAP_assembly.pl [options] output.prefix assembly.fasta [input.fastq|input.R1.fastq,input.R2.fastq [...]] > summary.txt

Options: -h display this help message -A STR prepared assembly prefix -B input indexed sorted BAM file instead of FASTQ file -m FILE executable file path of mapping program, "diamond" or "usearch" [diamond] -p INT number of threads [1] -e FLOAT maximum e-value to report alignments [10] -t DIR directory for temporary files [$TMPDIR or /tmp] -a FLOAT search acceleration for ublast [0.5] -C STR codon and translation e.g. ATG=M [NCBI genetic code 11 (Bacterial, Archaeal and Plant Plastid)] -S STR comma-separated start codons [GTG,ATG,CTG,TTG,ATA,ATC,ATT] -T STR comma-separated termination codons [TAG,TAA,TGA] -l INT minimum translation length [10] -c FLOAT minimum coverage [0.8] -q INT minimum mapping quality [0] -s STR strand specificity, "f" or "r" -P STR contig prefix used for abundance estimation


* **FMAP_assembly_centrifuge.pl**
  * Require [Centrifuge](https://ccb.jhu.edu/software/centrifuge/).
  * Input
    1. FMAP_assembly.region.txt (ORF regions mapping to KEGG orthologies generated by FMAP_assembly)
    2. De novo assembled sequences in FASTA format
    3. Centrifuge index filename prefix (minus trailing .X.cf)
  * Output: FMAP_assembly.region.taxon.txt (FMAP_assembly.region.txt including a column of [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) IDs (integer))

Usage: perl FMAP_assembly_taxon.pl [options] FMAP_assembly.region.txt assembly.fasta centrifuge.index

Options: -h display this help message -p INT number of threads [1]


* **FMAP_assembly_heatmap.pl**
  * Require [Bio::DB::Taxonomy](http://search.cpan.org/dist/BioPerl/Bio/DB/Taxonomy.pm).
  * Input: FMAP_assembly.abundance.txt (abundances generated by FMAP_assembly)
  * Output: HTML format of abundance heatmap table

Usage: perl FMAP_assembly_heatmap.pl [options] [name=]FMAP_assembly.abundance.txt [...] > FMAP_assembly_heatmap.html

Options: -h display this help message -c FILE comparison output file including orthology and filter columns -f INT HTML font size -w INT HTML table cell width


* **FMAP_assembly_operon.pl**
  * Input: FMAP_assembly.region.txt (ORF regions mapping to KEGG orthologies generated by FMAP_assembly)
  * Output: FMAP_assembly_operon.txt ([ODB (v3)](http://operondb.jp) known operons consisting of orthologies located together on an assembled contig/scaffold/transcript)

Usage: perl FMAP_assembly_operon.pl [options] FMAP_assembly.region.txt > FMAP_assembly_operon.txt

Options: -h display this help message -a print single-gene operons as well


* **FMAP_download_genome.pl**
  * Input: [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) IDs (integer)
  * Output: FASTA file containing genome sequences
  * Require [XML::LibXML](http://search.cpan.org/dist/XML-LibXML/LibXML.pod).

Usage: perl FMAP_download_genome.pl [options] NCBI_TaxID [...] > genome.fasta

Options: -h display this help message -a assembly instead of genome


* **FMAP_download.pl**

Usage: perl FMAP_download.pl [options]

Options: -h display this help message -m FILE executable file path of mapping program, "diamond" or "usearch" [diamond] -k download prebuilt KEGG files -x download only KEGG files


* **FMAP_mapping.pl**
  * Input: whole metagenomic (or metatranscriptomic) shotgun sequencing reads in FASTQ or FASTA format
  * Output: best-match hits in NCBI BLAST ‑m8 (= NCBI BLAST+ ‑outfmt 6) format

Usage: perl FMAP_mapping.pl [options] input1.fastq|input1.fasta [input2.fastq|input2.fasta [...]] > blastx_hits.txt

Options: -h display this help message -m FILE executable file path of mapping program, "diamond" or "usearch" [diamond] -p INT number of threads [1] -e FLOAT maximum e-value to report alignments [10] -t DIR directory for temporary files [$TMPDIR or /tmp] -a FLOAT search acceleration for ublast [0.5]


* **FMAP_quantification.pl**
  * Input: output of "FMAP_mapping.pl"
  * Output: abundances (RPKM) of KEGG orthologies
  * Output columns: [KEGG Orthology](http://www.genome.jp/kegg/ko.html) ID, orthology definition, abundance (RPKM)

Usage: perl FMAP_quantification.pl [options] blast_hits1.txt [blast_hits2.txt [...]] > abundance.txt

Options: -h display this help message -c use CPM values instead of RPKM values -i FLOAT minimum percent identity [80] -l FILE tab-delimited text file with the first column having protein names and the second column having the sequence lengths -o FILE tab-delimited text file with the first column having protein names and the second column having the orthology names -d FILE tab-delimited text file with the first column having orthology names and the second column having the definitions -w FILE tab-delimited text file with the first column having read names and the second column having the weights


* **FMAP_table.pl**
  * Input: outputs of "FMAP_quantification.pl"
  * Output: abundance table
  * Output columns: [KEGG Orthology](http://www.genome.jp/kegg/ko.html) ID, orthology definition, abundance of sample1, abundance of sample2, ...

Usage: perl FMAP_table.pl [options] [name1=]abundance1.txt [[name2=]abundance2.txt [...]] > abundance_table.txt

Options: -h display this help message -c use raw read counts (readCount|count) instead of RPKM values -d use normalized mean depths (meanDepth/genome) instead of RPKM values -f use fractions -n do not print definitions -r print ORF regions


* **FMAP_comparison.pl**
  * Input: output of "FMAP_table.pl", sample group information
  * Output: comparison test statistics for orthologies
  * Output columns: [KEGG Orthology](http://www.genome.jp/kegg/ko.html) ID, orthology definition, log2 fold change, p-value, FDR-adjusted p-value, filter (pass or fail)

Usage: perl FMAP_comparison.pl [options] abundance_table.txt control1[,control2[...]] case1[,case2[...]] [...] > orthology_test_stat.txt

Options: -h display this help message -t STR statistical test for comparing sample groups, "kruskal", "anova", "poisson", "quasipoisson", "metagenomeSeq" [kruskal] -f FLOAT fold change cutoff [2] -p FLOAT p-value cutoff [0.05] -a FLOAT FDR-adjusted p-value cutoff [1]


* **FMAP_pathway.pl**
  * Input: output of "FMAP_comparison.pl"
  * Output: pathways enriched in filter-passed orthologies
  * Output columns: [KEGG Pathway](http://www.genome.jp/kegg/pathway.html) ID, pathway definition, orthology count, coverage, p-value, [KEGG Orthology](http://www.genome.jp/kegg/ko.html) IDs with colors
  * [KEGG Orthology](http://www.genome.jp/kegg/ko.html) IDs with colors: input of [KEGG Pathway](http://www.genome.jp/kegg/pathway.html) mapping (http://www.kegg.jp/kegg/tool/map_pathway2.html)

Usage: perl FMAP_pathway.pl [options] orthology_test_stat.txt > pathway.txt

Options: -h display this help message


* **FMAP_module.pl**
  * Input: output of "FMAP_comparison.pl"
  * Output: modules enriched in filter-passed orthologies
  * Output columns: [KEGG Module](http://www.genome.jp/kegg/module.html) ID, module definition, orthology count, coverage, p-value, [KEGG Orthology](http://www.genome.jp/kegg/ko.html) IDs with colors
  * [KEGG Orthology](http://www.genome.jp/kegg/ko.html) IDs with colors: input of [KEGG Pathway](http://www.genome.jp/kegg/pathway.html) mapping (http://www.kegg.jp/kegg/tool/map_pathway2.html)

Usage: perl FMAP_module.pl [options] orthology_test_stat.txt > module.txt

Options: -h display this help message


* **FMAP_operon.pl**
  * Input: output of "FMAP_comparison.pl"
  * Output: operons consisting of filter-passed orthologies
  * Output columns: [ODB (v3)](http://operondb.jp) known operon IDs, operon definition, log2 fold change, [KEGG Orthology](http://www.genome.jp/kegg/ko.html) IDs, [KEGG Pathway](http://www.genome.jp/kegg/pathway.html) IDs

Usage: perl FMAP_operon.pl [options] orthology_test_stat.txt > operon.txt

Options: -h display this help message -a print single-gene operons as well


* **FMAP_plot.pl**
  * Input: output of "FMAP_pathway.pl", "FMAP_module.pl", or "FMAP_operon.pl"
  * Output: PNG format image file of p-value plot

Usage: perl FMAP_plot.pl [options] pathway.txt|module.txt|operon.txt plot.pdf

Options: -h display this help message -w INT plot width [12] -h INT plot height [8] -l FLOAT plot left margin [20] -p FLOAT p-value cutoff [0.05] -c FLOAT coverage cutoff [0 for pathway, 1 for module and operons] -d do not print definition


* **FMAP_all.pl**
  * Input: configuration table file
  * Input columns: group (control, ...), sample name, input file of "FMAP_mapping.pl"
  * Output: script file including all FMAP commands, all FMAP outputs

Usage: perl FMAP_all.pl [options] input.config [output_prefix]

Options: -h display this help message -s generate a script, but not execute it -m FILE mapping: executable file path of mapping program, "diamond" or "usearch" [diamond] -t INT mapping: number of threads [1] -c STR comparison: statistical test for comparing sample groups, "kruskal", "anova", "poisson", "quasipoisson", "metagenomeSeq" [kruskal] -f FLOAT comparison: fold change cutoff [2] -p FLOAT comparison: p-value cutoff [0.05] -a FLOAT comparison: FDR-adjusted p-value cutoff [1]



## Command orders

* Use the prebuilt database (UniRef90 and bacteria/archaea/fungi)
  1. FMAP_download.pl
  2. FMAP_mapping.pl
  3. FMAP_quantification.pl
  4. FMAP_table.pl
  5. FMAP_comparison.pl
  6. FMAP_pathway.pl
  7. FMAP_module.pl
  8. FMAP_operon.pl

* Use a custom database (you can define UniRef and taxonomy.)
  1. FMAP_database.pl
  2. FMAP_prepare.pl
  3. FMAP_mapping.pl
  4. FMAP_quantification.pl
  5. FMAP_table.pl
  6. FMAP_comparison.pl
  7. FMAP_pathway.pl
  8. FMAP_module.pl
  9. FMAP_operon.pl

## Citation

Kim J, Kim MS, Koh AY, Xie Y, Zhan X.
"FMAP: Functional Mapping and Analysis Pipeline for metagenomics and metatranscriptomics studies"
BMC Bioinformatics. 2016 Oct 10;17(1):420.
PMID: [27724866](https://www.ncbi.nlm.nih.gov/pubmed/27724866)