jsevo / taxumap

MIT License
15 stars 5 forks source link

TaxUMAP

Visualize structure in large microbiome datasets. Implements a microbiome-research-focused extension of the Uniform Manifold Approximation and Projection (UMAP) by calculating microbiome sample-by-sample distances at different taxonomic aggregations of taxon abundances, and allowing for custom weighting of aggregates.

Installation

Notice: TaxUMAP will be made available on both PyPi and Bioconda for installation via pip and conda. But until then, please use pip install -e . as described below to install in developer mode.

git clone https://github.com/jsevo/taxumap.git
pip install -e .

Quickstart: Notebook Example

Data required

Two tables are required: the microbiota data and a taxonomy table.

The microbiota data file (e.g., examples/example_data/microbiota_table.csv) must have a column with sample indices labeled 'index_column'. The remaining columns are expected to be amplicon sequence variant (ASV) labels or operational taxonomic unit (OTU) labels, i.e., the lowest level of taxonomy:

index_column ASV1 ASV2 ... ASV500
'sample1' 0.5 0.4 ... 0.1
'sample2' 0.2 0.6 ... 0.2
... ... ... ... ...
'sample3' 0.1 0.4 ... 0.5

The taxonomy table (e.g., examples/example_data/taxonomy.csv) is expected to resolve higher taxonomic groups for each ASV/OTU in the microbiota table. The index of the taxonomy table should be ASV/OTU labels, while the columns of the taxonomy table should be higher taxonomic categories (e.g., kingdom, phylum, etc.). The columns must be ordered from left to right in decreasing taxonomic hierarchy, e.g.:

ASV Kingdom Phylum ... Genus Species
'ASV1' 'Bacteria' 'Firmicutes' ... 'Staphylococcus' 'aureus'
'ASV2' 'Bacteria' 'Bacillota' ... '[Ruminococcus]' 'gnavus'
... ... ... ... ... ...
'ASV500' 'Bacteria' 'Verrucomicrobia' ... 'Akkermansia' 'muciniphila'

In the above tables, the '' designates strings. Any UNKNOWN taxonomic levels (e.g., 'unknown species') should be set to np.nan or the string 'nan'. For more information on how to properly resolve unknown taxonomic levels for TaxUMAP, please see the notebook examples/cleaning_taxonomy_table.ipynb. Finally, the taxonomy table should be monophyletic.


Usage

Command line:

run_taxumap.py -t examples/example_data/taxonomy.csv -m examples/example_data/microbiota_table.csv

The embedding will be saved in the current working folder, or to a location with the -o path/to/folder/ flag. Additionally, for best results, the flag -n should be folllowed by the number of unique patients in your dataset (see Optional flag information below for more details).

Python:

from taxumap.taxumap_base import Taxumap

##### Initialize Taxumap object #####

# From file
tu = Taxumap(taxonomy='path/to/taxonomy.csv',
            microbiome_data='path/to/microbiota_table.csv')

##### Run the transformation and look at the results #####

# Transform the data (an inplace function)
tu.transform_self()

# Raw embedding dataframe
tu.df_embedding

# "Which taxon dominate each sample?" dataframe
tu.df_dominant_taxon

# Visualize the embedding (will save to present working directory as "taxumap_scatterplot.pdf")
tu.scatter(save=True)

# Save the embedding (will save to present working directory as "taxumap_embedding.csv" if no parameter passed)
tu.save_embedding('path/to/embedding.csv')

Flags for Command-line Interface (CLI)

Required

Optional, but recommended

Optional, change default behavior


Example notebook (with example data)

A dataset provided by Olin et al. can be used to try out features of TaxUMAP and how to format new data properly.

Publication

Olin A, Henckel E, Chen Y, et al. Stereotypic Immune System Development in Newborn Children. Cell. 2018;174(5):1277-1292.e14. doi:10.1016/j.cell.2018.06.045

Dataset

Olin, Axel (2018), “Stereotypic Immune System Development in Newborn Children”, Mendeley Data, v1

For convenience, we are providing in the taxumap/examples/example_data directory a pre-cleaned version of this dataset, as allowed under the CC BY 4.0 license. An accompanying Jupyter Notebook is provided to demonstrated how the data was cleaned.

generated summary

This code defines a class called Taxumap, which is used for running the taxUMAP algorithm. The class constructor takes several arguments, including agg_levels, weights, microbiota_data, taxonomy, name, and random_state. These arguments are used to initialize attributes of the Taxumap object.

Some of the methods of the Taxumap class include transform_self(), scatter(), save_embedding(), and df_dominant_taxon(). These methods are used to perform the taxUMAP transformation, generate a scatter plot of the transformed data, save the embedding to a file, and get the dominant taxonomic group based on the maximum abundance in each sample, respectively.

Overall, this code provides a framework for running the taxUMAP algorithm and visualizing the results.

License

MIT