Woltka is a versatile program for determining the structure and functional capacity of microbiomes. It mainly works with shotgun metagenomic data. It bridges first-pass sequence aligners with advanced analytical platforms (such as QIIME 2). It takes full advantage of, and is not limited by, the WoL reference database. Its scope and highlights are:
Woltka ships with a QIIME 2 plugin. See here for instructions.
Woltka is a classifier. It fits in between sequence alignment and microbiome analyses.
Woltka processes alignments -- the mappings of microbiome sequencing data against reference sequences (such as genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the user. Woltka generates profiles (feature tables) -- the abundances of classification units which describe the structure or function of microbial communities.
Woltka provides several utilities for handling feature tables, including normalizing data, collapsing a table to higher-level features, calculating feature group coverage, filtering features based on per-sample abundance, and merging tables.
Woltka does NOT align sequences. You need to align your sequencing data (FastQ, etc.) against a reference database (we recommend WoL) using an aligner of your choice (e.g., Bowtie2). The resulting alignment files can be fed into Woltka.
Woltka does NOT analyze profiles. We recommend using QIIME 2 for robust downstream analyses of the profiles to decode the relationships among microbial communities and with their environments.
Flowchart of Woltka's main classification workflow:
Requirement: Python 3.6 or above.
pip install woltka
See more details about installation.
Woltka provides several small test datasets under woltka/tests/data. To access them, download this GitHub repo, unzip, and navigate to this directory.
One can execute the following commands to make sure that Woltka functions correctly, and to get an impression of the basic usage of Woltka.
(Note: a more complete list of commands is provided here. Alternatively, you can skip this test dataset and check out the instruction for working with WoL.)
woltka classify -i align/bowtie2 -o ogu.biom
The input path, align/bowtie2
, is a directory containing five Bowtie2 alignment files (S01.sam.xz
, S02.sam.xz
,... S05.sam.xz
) (SAM format, xzipped), each representing the mapping of metagenomic sequencing reads per sample against a reference genome database (here are guidlines for performing alignment).
The output file, table.biom
, is a feature table in BIOM format, which can then be analyzed using various bioformatics programs such as QIIME 2.
woltka classify \
--input align/bowtie2 \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--rank phylum,genus,species \
--output output_dir
The mapping file (taxid.map
) translates genome IDs to taxonomy IDs, which then allow Woltka to classify query sequences based on the NCBI taxonomy (nodes.dmp
and names.dmp
).
The output directory (output_dir
) will contain three feature tables: phylum.biom
, genus.biom
and species.biom
, each representing a taxonomic profile at one of the three ranks.
woltka classify \
--input align/bowtie2 \
--coords function/coords.txt.xz \
--map function/uniref/uniref.map.xz \
--map function/go/process.tsv.xz \
--rank uniref,process \
--output output_dir
Here, the input files are still read-to-genome alignments, rather than read-to-gene ones. Woltka matches reads with genes based on their coordinates on genomes using an efficient algorithm ("coord-match"). The gene coordinates are given by the database file coords.txt
(see details). The read coordinates are extracted from the alignment files. This ensures consistency between structural and functional analyses.
Subsequently, Woltka is able to assign query sequences to functional units, as defined in mapping files (uniref.map
and process.tsv
). As you can see, compressed files are supported and auto-detected.
Similarly, the output files are two functional profiles: uniref.biom
and process.biom
.
Two steps. First, perform taxonomic classification. The --outmap
parameter writes a read-to-genus mapping file per sample to the directory genus_map/
. The --name-as-id
flag replaces NCBI TaxIDs with real taxon names in the output.
woltka classify \
--input align/bowtie2 \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--name-as-id \
--rank genus \
--output genus.biom \
--outmap genus_map
Second, perform functional classification. The --stratify
parameter imports the genus mappings from the last analysis, and groups functional units (GO processes) by the genus of the source genome.
woltka classify \
--input align/bowtie2 \
--stratify genus_map \
--coords function/coords.txt.xz \
--map function/uniref/uniref.map.xz \
--map function/go/process.tsv.xz \
--rank process \
--output genus_by_process.biom
In the output profile (see below), each feature is a combination of taxonomy and function. This "stratified" profile lets the researcher explore the functional capacities of individual microbial components.
Feature ID | S01 | S02 | S03 | S04 | S05 |
---|---|---|---|---|---|
Aeromonas|GO:0000917 | 4 | 20 | 3 | 0 | 7 |
Aeromonas|GO:0005975 | 0 | 12 | 5 | 2 | 0 |
Bacteroides|GO:0006260 | 105 | 0 | 0 | 0 | 0 |
Bacteroides|GO:0006281 | 10 | 6 | 2 | 0 | 3 |
Lactobacillus|GO:0045454 | 2 | 0 | 0 | 34 | 3 |
Lactobacillus|GO:0055085 | 0 | 0 | 7 | 0 | 0 |
... |
The first paper describing Woltka was published at:
Note: This paper focuses on the OGU analysis. Although it does not discuss other functions of Woltka, it is so far the only citable paper if you use Woltka in your studies.
Please forward any questions to the project leader: Dr. Qiyun Zhu (qiyunzhu@gmail.com).