Open nick-youngblut opened 6 years ago
Rarefaction in Phyloseq vs rarefaction in qiime 2?
The OTU table was already rarefied before beta-diversity calculation with either phyloseq or qiime2. Qiime2 rarefies without replacement, so using a rarefying sampling depth set to the same value as the sample sums for the pre-rarefied OTU table is equivalent to not rarefying (at least, it should be).
Hi,
I would guess that you have been using an un-rooted phylogenetic tree when calculating the Unifrac distances. To calculate the UniFrac distances you need a rooted tree, and I'm 95% sure that Qiime2 and phyloseq are using different ways to root the phylogenetic tree.
Best, Martin
I believe that all of the trees that I've been using for testing are with rooted phylogenies. I've posted a Jupyter notebook on the qiime2 forum if you'd like to see an example with the GlobalPatterns
dataset.
Dear Nick
Did you hear anything from the Phyloseq guys or the Qiime2 team? We are having a similar issue.
I think that Greg Caporaso was going to look into it, but I haven't heard anything about it. I just ended up using qiime2 instead of phyloseq, because it generated more realistic results for my dataset.
I have exactly the same issue. Moreover, there are two different ways in phyloseq to get wunifrac distance matrix, and both give different results (UniFrac(X, weighted=TRUE, normalized=TRUE) vs. distance(X, method="wunifrac")).
@siberianhigh how do the two phyloseq methods compare to the weighted Unifrac that QIIME2 is producing?
@cjfields they both are different from qiime (and much more similar to each other than to qiime). I still prefer QIIME1, but in my experience there is no difference with QIIME2.
Moreover, "Depending on the method argument, distance() wraps one of UniFrac, DPCoA, JSD, vegdist, betadiver, designdist, or dist." So, I don't understand the origin of these differences..
My lab has been dealing with this as well. I tried to make a reproducible test based off skbio's python Unifrac (which is what QIIME runs under the hood, I think), and I actually can't get it to run in phyloseq.
skbio documentation: http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html
Code adapted to phyloseq
# Reproducible example taken from skbio's Unifract function
# http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html
library(phyloseq)
library(ape)
# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
table=data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep=""))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")
mydata = phyloseq(otu_table(table, taxa_are_rows=TRUE), phy_tree(tree))
# Unifrac; unweighted should be 0.37
UniFrac(mydata, weighted=FALSE)
The result of this is Error in node.desc[i - ntip, ] : subscript out of bounds
. This appears to be from line 616 in distance-methods.R (starting at 614 for context):
# Loop over internal nodes, summing their descendants to get that nodes count
for(i in ord.node){
edge_array[i,] <- colSums(edge_array[node.desc[i-ntip,], , drop=FALSE], na.rm = TRUE)
}
I don't know the guts of how UniFrac is calculated, but it seems that the code is making an assumption that isn't always satisfied. (That there are fewer tips than internal nodes?)
Given that I also ran into an independent error (#1215) using data phyloseq had filtered, I think it's fair to say that the phyloseq UniFrac implementation still has some bugs in it. Unfortunately, I don't know phylogenetic trees and UniFrac well enough to suggest fixes or I'd try to do so. The QIIME developers claims to benchmark their UniFrac against the original PyCogent implementation every build, so it's probably the safer bet to use QIIME for UniFrac until this is resolved. (If need be, you can import the resulting distance matrix into R for downstream analysis.)
@wallacelab have you tried other UniFrac implementations within R, such as unifrac
from the picante
package? I'm wondering whether the simplest short-term fix would be to switch to this if the results are comparable to QIIME2's (and if QIIME2's implementation has been tested).
@cjfields I finally got time to test this, comparing unifrac from phyloseq
, rbiom
, picante
, and GUniFrac
# ##########
# Install needed libraries
# ##########
install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom'))
devtools::install_github("cmmr/rbiom")
# ############
# Benchmark dataset (very simple)
# ############
# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
counts=as.matrix(data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep="")))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")
# ########
# Generate UniFrac distances of everything
# ########
# Phyloseq
library(phyloseq)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)
# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)
# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]
# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)
# Weighted values
rbiom_weighted
guni_weighted
# Unweighted values
rbiom_unweighted
guni_unweigthed
picante_unweighted
Short version is that phyloseq still fails on this, but the other three give identical unweighted unifrac distances (0.3692308). The weighted value is different from rbiom (1.543434) to GUniFrac (0.3275034), which I suspect is due to the latter normalizing it, but I haven't had a chance to confirm.
Still working to check with QIIME, but I think your idea of just using a different library is probably a good one. (Note: picante
only does unweighted, I think, and in my tests was significantly slower than the others, so I'd recommend rbiom
or GUniFrac
. )
Okay, I did some more digging to compare to QIIME.
First, bash script to get sample data and run UniFrac (data is QIIME's own Moving Pictures tutorial):
# Download data
wget https://docs.qiime2.org/2018.11/data/tutorials/moving-pictures/table-deblur.qza # OTU Table
wget https://docs.qiime2.org/2018.11/data/tutorials/moving-pictures/rooted-tree.qza # Tree
# Variables
orig_table=table-deblur.qza
filtered_table=table-filtered.qza
tree=rooted-tree.qza
# Remove OTUs not present in phylogeny
qiime phylogeny filter-table \
--i-table $orig_table \
--i-tree $tree \
--o-filtered-table $filtered_table
# Calculate both weighted and unweighted unifrac
for metric in weighted_unifrac unweighted_unifrac; do
qiime diversity beta-phylogenetic \
--i-table $filtered_table \
--i-phylogeny $tree \
--p-metric $metric \
--o-distance-matrix $metric.qza \
--verbose
# Export to text
qiime tools export \
--input-path $metric.qza \
--output-path $metric
# Move distance matrix file to more sensible location
mv $metric/distance-matrix.tsv $metric.tsv
rmdir $metric
done
# Export table and tree for comparison with R functions
qiime tools export --input-path $filtered_table --output-path ./
qiime tools export --input-path $tree --output-path ./
# Convert biom table to text (less likely to have import issues)
biom convert -i feature-table.biom -o feature-table.biom.txt --to-tsv
Now, R script to compare UniFrac measures:
# Install needed packages
install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
devtools::install_github("cmmr/rbiom")
# Load QIIME output files
counts=as.matrix(read.delim("feature-table.biom.txt", header=T, row.names=1, skip=1))
tree=ape::read.tree("tree.nwk")
ape::is.rooted(tree) # Confirm tree is already rooted
# ########
# Generate UniFrac distances of everything
# ########
# QIIME
qiime_weighted = as.dist(read.delim("weighted_unifrac.tsv", row.names=1))
qiime_unweighted = as.dist(read.delim("unweighted_unifrac.tsv", row.names=1))
# Phyloseq
library(phyloseq)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)
# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)
# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]
# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)
# ###########
# Compare methods
# ###########
compare_dists = function(mydists){
comparison = matrix(NA, nrow=length(mydists), ncol=length(mydists), dimnames=list(names(mydists), names(mydists)))
for(i in 1:length(mydists)){
for(j in i:length(mydists)){
comparison[i,j] = cor(mydists[[i]], mydists[[j]])
}
}
return(comparison)
}
weighted = list(qiime=qiime_weighted, phyloseq = phyloseq_weighted, rbiom = rbiom_weighted, gunifrac = as.dist(guni_weighted))
unweighted = list(qiime=qiime_unweighted, phyloseq = phyloseq_unweighted, rbiom = rbiom_unweighted, gunifrac = as.dist(guni_unweighted), picante=picante_unweighted)
compare_dists(weighted)
compare_dists(unweighted)
Of note, phyloseq
gives gives the following warning:
Warning message:
In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE, :
data length [895] is not a sub-multiple or multiple of the number of rows [448]
The final matrix of comparisons (= correlations among the distance matrices) is:
> compare_dists(weighted)
qiime phyloseq rbiom gunifrac
qiime 1 0.3009552 1.0000000 0.9757110
phyloseq NA 1.0000000 0.3009552 0.4489365
rbiom NA NA 1.0000000 0.9757110
gunifrac NA NA NA 1.0000000
> compare_dists(unweighted)
qiime phyloseq rbiom gunifrac picante
qiime 1 0.9791214 0.9998755 0.9998755 1.0000000
phyloseq NA 1.0000000 0.9799686 0.9799686 0.9791214
rbiom NA NA 1.0000000 1.0000000 0.9998755
gunifrac NA NA NA 1.0000000 0.9998755
picante NA NA NA NA 1.0000000
So in short, rbiom
has a perfect correlation with QIIME
for weighted unifrac and an almost perfect one for unweighted. (Not sure why the difference.) picante
has a perfect correlation for unweighted.
So it seems if you have to pick one, pick rbiom
, and if you can pick two, rbiom
for weighted and picante
for unweighted. (Of course, this is all aside from the fact that 4-5 different packages provide nearly as many different answers for what is supposed to be a deterministic calculation, but that's another issue.)
Update: I realized I hadn't rarefied the data before doing UniFrac. Rarefaction makes the Unweighted Unifac equal across everything (except phyloseq), but weighted still only matches for rbiom:
qiime phyloseq rbiom gunifrac
qiime 1 0.5098069 1.0000000 0.9906098
phyloseq NA 1.0000000 0.5098069 0.5729023
rbiom NA NA 1.0000000 0.9906098
gunifrac NA NA NA 1.0000000
> compare_dists(unweighted)
qiime phyloseq rbiom gunifrac picante
qiime 1 0.975154 1.000000 1.000000 1.000000
phyloseq NA 1.000000 0.975154 0.975154 0.975154
rbiom NA NA 1.000000 1.000000 1.000000
gunifrac NA NA NA 1.000000 1.000000
picante NA NA NA NA 1.000000
So at least as far as matching QIIME goes, rbiom seems the most reliable. (Incidentally, I also tried this with the benchmark data skbio uses for unit testing, and in that case everything but phyloseq matches 100% for both metrics. Looks like there's some complication in real data that the benchmark isn't capturing.)
@wallacelab I'm able to reproduce this, great walkthrough! I suspect the fast UniFrac method implemented in phyloseq has issues with weighting.
It does look like UniFrac distance is supposedly tested?
Yes, and that testing seems to work. (I plugged the data into my comparison and it came out fine; code at end):
> compare_dists(weighted)
gp500_wufu gp500_wuf phyloseq rbiom gunifrac
gp500_wufu 1 0.9765762 0.9765762 1.0000000 0.9765762
gp500_wuf NA 1.0000000 1.0000000 0.9765762 1.0000000
phyloseq NA NA 1.0000000 0.9765762 1.0000000
rbiom NA NA NA 1.0000000 0.9765762
gunifrac NA NA NA NA 1.0000000
> compare_dists(unweighted)
gp500_uuf phyloseq rbiom gunifrac picante
gp500_uuf 1 1 1 1 1
phyloseq NA 1 1 1 1
rbiom NA NA 1 1 1
gunifrac NA NA NA 1 1
picante NA NA NA NA 1
It looks like some programs have different defaults for normalization, but otherwise everything matches. So it looks like the test data lacks whatever feature is causing issues for phyloseq. (I suspect it's with the phylogenetic tree having an odd number of total edges, given the error it spits out.) Notably, the Global Patterns tree used in the unit test has 998 edges (even) and works, but the Moving Pictures one has 1515 (odd) and fails.
Code for above comparison:
# # Install needed packages
# install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
# devtools::install_github("cmmr/rbiom")
library(phyloseq)
# Sample Global Patterns subset data from the phyloseq unit tests
treeFile = system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
GP500File = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
GP500 = import_qiime(GP500File, treefilename = treeFile)
# Now import the results with read.table()
gp500_uuf = read.csv(system.file("extdata", "gp500-uuf.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
gp500_wuf = read.csv(system.file("extdata", "gp500-wuf.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
gp500_wufu = read.csv(system.file("extdata", "gp500-wufu.csv", package = "phyloseq"), header = FALSE, fill = TRUE)
# Add the sample names
colnames(gp500_uuf) <- rownames(gp500_uuf) <- colnames(gp500_wuf) <- rownames(gp500_wuf) <- colnames(gp500_wufu) <- rownames(gp500_wufu) <- sample_names(GP500)
# Coerce to Distance Matrices for comparison `"dist"` class
gp500_wufu <- as.dist(gp500_wufu)
gp500_wuf <- as.dist(gp500_wuf)
gp500_uuf <- as.dist(gp500_uuf)
# ########
# Generate UniFrac distances of everything
# ########
counts = otu_table(GP500)
tree = phy_tree(GP500)
# Phyloseq
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE)
phyloseq_unweighted = phyloseq::UniFrac(phyloseq_data, weighted=FALSE)
# rbiom
rbiom_weighted = rbiom::unifrac(counts, weighted=TRUE, tree=tree)
rbiom_unweighted = rbiom::unifrac(counts, weighted=FALSE, tree=tree)
# GUniFrac
gunifracs = GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)$unifracs
guni_weighted = gunifracs[, , "d_1"]
guni_unweighted = gunifracs[, , "d_UW"]
# picante (unweighted unifrac only, and definitely the slowest)
picante_unweighted = picante::unifrac(t(counts), tree)
# ###########
# Compare methods
# ###########
compare_dists = function(mydists){
comparison = matrix(NA, nrow=length(mydists), ncol=length(mydists), dimnames=list(names(mydists), names(mydists)))
for(i in 1:length(mydists)){
for(j in i:length(mydists)){
comparison[i,j] = cor(mydists[[i]], mydists[[j]])
}
}
return(comparison)
}
weighted = list(gp500_wufu=gp500_wufu, gp500_wuf=gp500_wuf, phyloseq = phyloseq_weighted, rbiom = rbiom_weighted, gunifrac = as.dist(guni_weighted))
unweighted = list(gp500_uuf=gp500_uuf, phyloseq = phyloseq_unweighted, rbiom = rbiom_unweighted, gunifrac = as.dist(guni_unweighted), picante=picante_unweighted)
compare_dists(weighted)
compare_dists(unweighted)
@wallacelab odd, this almost sounds like a fence-post error but only for weighted UniFrac. I see how it fell through the cracks; definitely worth adding a simple test case for catching this.
Hi @wallacelab and @cjfields, if you want to compare phyloseq's weighted unifrac calculation to rbiom's, you should set normalized = FALSE
. I.e.,
phyloseq_weighted = phyloseq::UniFrac(phyloseq_data, weighted=TRUE, normalized = FALSE)
Then running @wallacelab 's code gives near perfect similarity between phyloseq and rbiom's weighted calculations.
Rbiom is computing the quantity u
in DOI.org/10.1128/AEM.01996-06, whereas I believe phyloseq with the normalized = TRUE
option (the default) is returning u / D
and with normalized = FALSE
is returning u
.
(Please let me know if I've missed something here)
Ping. Interested in any updates here. Is this just a normalization difference?
Ping. Interested in any updates here. Is this just a normalization difference?
No, it's not. Different normalization defaults are a minor part of it, but the bigger issue seems to be that Phyloseq is making assumptions about the data without checking if those assumptions are true.
The two outstanding problems I see are:
2) phyloseq still throws a mysterious error while trying to run very basic SKBio test data (reproduced below for convenience). (See my above post on 19 Sept 2019 for the line of code that seems to be causing it.)
# Reproducible example taken from skbio's Unifract function
# http://scikit-bio.org/docs/0.4.2/generated/generated/skbio.diversity.beta.unweighted_unifrac.html
library(phyloseq)
library(ape)
# Make OTU table
u_counts = c(1, 0, 0, 4, 1, 2, 3, 0)
v_counts = c(0, 1, 1, 6, 0, 1, 0, 0)
table=data.frame(u=u_counts, v=v_counts, row.names=paste("OTU", 1:8, sep=""))
tree = read.tree(text="(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,(OTU5:0.5,((OTU6:0.33,OTU7:0.62):0.5,OTU8:0.5):0.5):0.5):1.25):0.0)root;")
mydata = phyloseq(otu_table(table, taxa_are_rows=TRUE), phy_tree(tree))
# Unifrac; unweighted should be 0.37
UniFrac(mydata, weighted=FALSE)
> Error in node.desc[i - ntip, ] : subscript out of bounds
So no, not just normalization. (Though it would be nice to include a message to the user when UniFrac is run so they know what the default is.)
@wallacelab do you have a comparison of the speeds for each implementation of unifrac that you list above? I'm processing a large OTU table right now with GUniFrac, and it's been running for ~72 hours, so I'm looking for a faster implementation.
@nick-youngblut I didn't, but it's not too hard to test. (see https://www.alexejgossmann.com/benchmarking_r/ for benchmarking tutorial)
# # Install needed packages
# install.packages(c('devtools', 'picante', 'phyloseq', 'rbiom', 'ape'))
# devtools::install_github("cmmr/rbiom")
# devtools::install_github("eddelbuettel/rbenchmark")
library(phyloseq)
# Sample Global Patterns subset data from the phyloseq unit tests
treeFile = system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
GP500File = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package = "phyloseq")
GP500 = import_qiime(GP500File, treefilename = treeFile)
# ########
# Benchmark different methods
# ########
counts = otu_table(GP500)
tree = phy_tree(GP500)
phyloseq_data = phyloseq(otu_table(counts, taxa_are_rows=TRUE), tree)
# benchmark
rbenchmark::benchmark("phyloseq"={phyloseq::UniFrac(phyloseq_data, weighted=TRUE)},
"rbiom" = {rbiom::unifrac(counts, weighted=TRUE, tree=tree)},
"gunifrac" = {GUniFrac::GUniFrac(t(counts), tree=tree, alpha=1)},
replications=10,
columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
Which results (on my computer) in
test replications elapsed relative user.self sys.self
3 gunifrac 10 98.359 322.489 98.292 0.036
1 phyloseq 10 0.755 2.475 0.755 0.000
2 rbiom 10 0.305 1.000 0.105 0.080
So looks like rbiom is fastest (~2.5x faster than phyloseq and ~300x faster than gunifrac)
I can back up those results. With an OTU table of approx. 6000 x 3500, GUniFrac didn't finish after 72 hours, while rbiom (with 4 threads) finished in < 1 hour. I didn't try phyloseq due to the questionable results that it can generate.
Any updates on this?
For phyloseq unifrac I wonder if this will solve any of the discrepancies?
I was having issues with edges in the tree and doing this resolved them. But, never validated answers against other programs calculating unifrac
tree = read.tree("FinalRFiles/SalAMPtree.nwk")
new_tre <- ape::multi2di(tree)
Which is mentioned above, but sounds like this may be a bad fix?
I cannot add much to this thread but all I can say is I have a massive dataset of vertebrate gut microbiomes and the phyloseq unifrac ordination leads to a drastically different biological conclusion from the qiime2 ordination. I made to sure to rarefy in phyloseq as well. Based off this thread I will be sticking with the qiime2 output but I hope this gets fixed!
all I can say is I have a massive dataset of vertebrate gut microbiomes and the phyloseq unifrac ordination leads to a drastically different biological conclusion from the qiime2 ordination
It's getting close to 4 years since my original post of this issue, and it seems that qiime (qiime2 now) is still the way to go for calculating (phylogenetic) beta diversity.
It might be best to include large warnings in the documentation that qiime2 and phyloseq can generate very different beta diversity values
@nick-youngblut I agree re: warnings. Unfortunately (on the surface at least) it appears there has been very little additional maintenance or development on phyloseq, though hopefully that will change!
Hi @nick-youngblut - could you comment if this is happening with unweighted unifrac also, or only weighted unifrac. I have noticed that weighted unifrac in phyloseq seems very different in biological conclusion from other measures such as bray-curtis.
@CarlyMuletzWolz see the above posts on unweighted unifrac
A couple of members of my lab and I have been getting very different results when using qiime2 versus phyloseq for calculating weighted unifrac values. It seems that the default parameters used by these tools (qiime diversity core-metrics-phylogenetic versus phyloseq::distance) generate very different weighted unifrac values, even when pre-rarefying the dataset (since qiime2 automatically rarefies).
I've already posted this issue with a reproducible example with the
GlobalPatterns
dataset on the qiime2 forum.Any idea why there's such a big difference between the qiime2 and phyloseq defaults for calculating weighted unifrac values? Moreover, which method is generating "correct" weighted unifrac values? At least from analyzing my own microbiome dataset, the conclusions are completely different depending on if I use the qiime2- versus phyloseq-generated weighted unifrac values.