qiyunzhu / woltka

Woltka: a versatile meta'omic data classifier
BSD 3-Clause "New" or "Revised" License
71 stars 25 forks source link
bioinformatics metagenomics microbiome

Woltka

License CI Status Coverage Status PyPI Conda (channel only)

Woltka is a versatile program for determining the structure and functional capacity of microbiomes. It mainly works with shotgun metagenomic data. It bridges first-pass sequence aligners with advanced analytical platforms (such as QIIME 2). It takes full advantage of, and is not limited by, the WoL reference database. Its scope and highlights are:

Woltka scope

Woltka ships with a QIIME 2 plugin. See here for instructions.

Contents

Overview

Where does Woltka fit in a pipeline

Woltka is a classifier. It fits in between sequence alignment and microbiome analyses.

What does Woltka do

Woltka processes alignments -- the mappings of microbiome sequencing data against reference sequences (such as genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the user. Woltka generates profiles (feature tables) -- the abundances of classification units which describe the structure or function of microbial communities.

What else does Woltka do

Woltka provides several utilities for handling feature tables, including normalizing data, collapsing a table to higher-level features, calculating feature group coverage, filtering features based on per-sample abundance, and merging tables.

What does Woltka not do

Woltka does NOT align sequences. You need to align your sequencing data (FastQ, etc.) against a reference database (we recommend WoL) using an aligner of your choice (e.g., Bowtie2). The resulting alignment files can be fed into Woltka.

Woltka does NOT analyze profiles. We recommend using QIIME 2 for robust downstream analyses of the profiles to decode the relationships among microbial communities and with their environments.

Flowchart of Woltka's main classification workflow:

Woltka process

Installation

Requirement: Python 3.6 or above.

pip install woltka

See more details about installation.

Example usage

Woltka provides several small test datasets under woltka/tests/data. To access them, download this GitHub repo, unzip, and navigate to this directory.

One can execute the following commands to make sure that Woltka functions correctly, and to get an impression of the basic usage of Woltka.

(Note: a more complete list of commands is provided here. Alternatively, you can skip this test dataset and check out the instruction for working with WoL.)

1. OGU (operational genomic unit) table generation (details):

woltka classify -i align/bowtie2 -o ogu.biom

The input path, align/bowtie2, is a directory containing five Bowtie2 alignment files (S01.sam.xz, S02.sam.xz,... S05.sam.xz) (SAM format, xzipped), each representing the mapping of metagenomic sequencing reads per sample against a reference genome database (here are guidlines for performing alignment).

The output file, table.biom, is a feature table in BIOM format, which can then be analyzed using various bioformatics programs such as QIIME 2.

2. Taxonomic profiling at the ranks of phylum, genus and species (details):

woltka classify \
  --input  align/bowtie2 \
  --map    taxonomy/taxid.map \
  --nodes  taxonomy/nodes.dmp \
  --names  taxonomy/names.dmp \
  --rank   phylum,genus,species \
  --output output_dir

The mapping file (taxid.map) translates genome IDs to taxonomy IDs, which then allow Woltka to classify query sequences based on the NCBI taxonomy (nodes.dmp and names.dmp).

The output directory (output_dir) will contain three feature tables: phylum.biom, genus.biom and species.biom, each representing a taxonomic profile at one of the three ranks.

3. Functional profiling by UniRef entries, then by GO molecular processes (details):

woltka classify \
  --input  align/bowtie2 \
  --coords function/coords.txt.xz \
  --map    function/uniref/uniref.map.xz \
  --map    function/go/process.tsv.xz \
  --rank   uniref,process \
  --output output_dir

Here, the input files are still read-to-genome alignments, rather than read-to-gene ones. Woltka matches reads with genes based on their coordinates on genomes using an efficient algorithm ("coord-match"). The gene coordinates are given by the database file coords.txt (see details). The read coordinates are extracted from the alignment files. This ensures consistency between structural and functional analyses.

Subsequently, Woltka is able to assign query sequences to functional units, as defined in mapping files (uniref.map and process.tsv). As you can see, compressed files are supported and auto-detected.

Similarly, the output files are two functional profiles: uniref.biom and process.biom.

4. Combined taxonomic/functional profiling by GO molecular processes of individual genera of organisms (details).

Two steps. First, perform taxonomic classification. The --outmap parameter writes a read-to-genus mapping file per sample to the directory genus_map/. The --name-as-id flag replaces NCBI TaxIDs with real taxon names in the output.

woltka classify \
  --input  align/bowtie2 \
  --map    taxonomy/taxid.map \
  --nodes  taxonomy/nodes.dmp \
  --names  taxonomy/names.dmp \
  --name-as-id \
  --rank   genus \
  --output genus.biom \
  --outmap genus_map

Second, perform functional classification. The --stratify parameter imports the genus mappings from the last analysis, and groups functional units (GO processes) by the genus of the source genome.

woltka classify \
  --input    align/bowtie2 \
  --stratify genus_map \
  --coords   function/coords.txt.xz \
  --map      function/uniref/uniref.map.xz \
  --map      function/go/process.tsv.xz \
  --rank     process \
  --output   genus_by_process.biom

In the output profile (see below), each feature is a combination of taxonomy and function. This "stratified" profile lets the researcher explore the functional capacities of individual microbial components.

Feature ID S01 S02 S03 S04 S05
Aeromonas|GO:0000917 4 20 3 0 7
Aeromonas|GO:0005975 0 12 5 2 0
Bacteroides|GO:0006260 105 0 0 0 0
Bacteroides|GO:0006281 10 6 2 0 3
Lactobacillus|GO:0045454 2 0 0 34 3
Lactobacillus|GO:0055085 0 0 7 0 0
...

Citation

The first paper describing Woltka was published at:

Note: This paper focuses on the OGU analysis. Although it does not discuss other functions of Woltka, it is so far the only citable paper if you use Woltka in your studies.

Contact

Please forward any questions to the project leader: Dr. Qiyun Zhu (qiyunzhu@gmail.com).