lmc297 / bactaxR

Bacterial taxonomy construction and evaluation in R
13 stars 4 forks source link

ANI.dendrogram hclust error #2

Closed jessih26 closed 3 years ago

jessih26 commented 3 years ago

Hello,

Thank you so much for developing and sharing this tool! I am trying to make a tree from fastANI pairwise data. After successfully importing the txt file, I received the following error when I ran dend <- ANI.dendrogram(bactaxRObject = ani, ANI_threshold = 95) on my dataset:

Error in hclust(d = d.dist, method = "average") : NaN dissimilarity value.

I double-checked that my data dose not contain any NAs by running

is.na(ani@ANI)%>%sum() [1] 0

I am just wondering if you have any insight on how to fix this and it seems like that the hclust command introduces some NAs. Thank you.

Best, Jessi

lmc297 commented 3 years ago

Hi Jessi,

It seems there might be an issue when your ANI values are being converted into a dissimilarity matrix (which is almost certainly not your fault!). It's hard for me to know exactly what's going on without seeing your data, but I would try the following (where ani is your bactaxR object, per the code you have above), and replying to this comment with the output you get:

1) Just to double-check everything loaded correctly: summary(ani)

2) Just to get a quick summary of your ANI values to make sure there's nothing out of the ordinary/weird: summary(ani@ANI)

3) Another quick test, i.e., a histogram, just to be safe: ANI.histogram(bactaxRObject = ani)

4) Try copying and pasting the code chunk below into RStudio and running it line-by-line; in addition to figuring out if/where any errors are occurring, I have some sanity checks in there to see if there are any places where things are not going as planned :

# convert your bactaxR object to a data frame
fastani <- data.frame(ani@query, ani@reference, ani@ANI)

# add column names to the data frame
colnames(fastani) <- c("query", "reference", "ANI")

# convert to a square matrix, add rownames, drop query column
s <- dcast(fastani, formula <- query~reference, value.var = "ANI")
rownames(s) <- s$query
s <- as.matrix(s[ , !(colnames(s) == "query")])

# as a sanity check, let's look at the dimensions of our matrix
dim(fastani)
dim(s) 
# s should be square, where nrow(s)*ncol(s) should equal nrow(fastani)

# another sanity check: check that the values of s make sense, i.e., no NAs
summary(as.vector(s))

# create a square matrix of 100s
j <- matrix(data = 100, nrow = nrow(s), ncol = ncol(s))

# convert s to a matrix of dissimilarities d by subtracting from j
d <- j - s

# sanity check: dimensions of d should equal dimensions of s
dim(s)
dim(d)

# sanity check: dissimilarities make sense and there are no NAs
summary(as.vector(d))

# make the dissimilarities symmetric
d.sym <- 0.5 * (d + t(d))

# again, a sanity check: dimensions of d.sym should equal those of d and s
dim(d.sym)

# another sanity check: d.sym values make sense, are not NA
summary(as.vector(d.sym))

# convert d.sym to a distance object
d.dist <- as.dist(d.sym)

# sanity check: make sure d.dist doesn't have NAs
summary(as.vector(d.dist))

# run hclust on d.dist and plot, see if any errors appear
h <- hclust(d = d.dist, method = "average")
plot(h)

Hopefully this might help identify where the error(s) occur! I'd also be happy to look at your raw data, if you prefer.

Let me know if this helps,

Laura

jessih26 commented 3 years ago

Hello Laura,

Thank you so much for your thorough explanation! I found that somehow fastANI did not perform pairwise comparison across all possible pairs so my s matrix is not a perfect square. I switched to my AAI results generated using other tools and successfully got the dendrogram. Thanks a lot for your help!

Best, Jessi