microbiome / miaViz

Microbiome Analysis Plotting and Visualization
https://microbiome.github.io/miaViz
Artistic License 2.0
9 stars 10 forks source link

plotRDA, plot loadings #92

Open TuomasBorman opened 10 months ago

TuomasBorman commented 10 months ago

In OMA, there is an example on how to plot loadings of features in dbRDA. We could consider to generalize this to plotting function if that is a common way to analyze which features drive the difference

TuomasBorman commented 10 months ago

https://github.com/microbiome/miaViz/pull/87

ElySeraidarian commented 1 day ago

I would like to provide some examples on how to plot these feature loadings after talking with @antagomir about this.

There would be 2 ways of plotting :

This requires the latest github version of mia.

1) With tree, this option means that the tree need to be agglomerated to the same features as the loadings matrix we get after performing the reduction method (this is now possible for LDA & NMF methods as agglomerateByPrevalence has been fixed to be possible to agglomerate tree). Here are quick examples for each of these reduction methods :

df2 <- data.frame(abs(loadings_matrix)) rownames(df2) <- phylo$tip.label

Plot the tree with first inner circle (Classes)

p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )

Plot the feature loadings in different circles

for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) )

p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) }

p

- NMF

library(NMF) library(mia) library(ggtree) data(GlobalPatterns) tse <- GlobalPatterns tse <- transformAssay(tse, method = "relabundance")

Agglomerate to keep only high prevalence features

tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)

Perform the NMF reduction method

x <- t(assay(tse, "counts")) nmf2 <- nmf(x, 2) H <- nmf2@fit@H W <- nmf2@fit@W

Get the feature loadings

feature_loadings <- t(H) phylo <- rowTree(tse) circ <- ggtree(phylo, layout = "circular") df <- rowData(tse) color <- randomcoloR::distinctColorPalette( length( unique( df$Phylum ) ) ) df <- data.frame(Class = df$Phylum) rownames(df) <- phylo$tip.label

df2 <- data.frame(feature_loadings) rownames(df2) <- phylo$tip.label

Plot the tree and the first inner circle (Classes)

p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )

Plot the feature loadings in different circles

for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) ) p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) } p

- LDA

library(mia) library(topicmodels) library(ggtree)

data(GlobalPatterns) tse <- GlobalPatterns

tse <- transformAssay(tse, method = "relabundance")

Agglomerate to keep only high prevalence features

tse <- agglomerateByPrevalence(tse, rank = "Class", update.tree = TRUE, prevalence = 0.99)

Perform LDA reduction method

lda_model <- LDA(t(assay(tse, "counts")), k = 2)

df <- as.data.frame(t(assay(tse, "counts")))

posteriors <- topicmodels::posterior(lda_model, df)

Get the feature loadings

feature_loadings <- t(as.data.frame(posteriors$terms))

phylo <- rowTree(tse) circ <- ggtree(phylo, layout = "circular") df <- rowData(tse) color <- randomcoloR::distinctColorPalette( length( unique( df$Phylum ) ) ) df <- data.frame(Class = df$Phylum) rownames(df) <- phylo$tip.label

df2 <- data.frame(feature_loadings) rownames(df2) <- phylo$tip.label

Plot the tree and first inner circles (Classes)

p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )

Plot the feature loadings in different circles

for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) ) p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) } p


This is the exact same process for every reduction method, the only thing that changes is how the loadings matrix is retrieved and stored which is why wrappers discussed in [microbiome/mia#596](https://github.com/microbiome/mia/issues/596) could be helpful.  As also discussed it might be possible to recognize the type of reduction method so that the user do not have to specify method when calling function.
Note that we agglomerate with a high prevalence in order to have only few features so that it can be understandable (and that the legend does not get out of the screen).

2) Without tree, I have made an example for PCA, there are still things that can be changed (e.g. adding coefficients inside circles?, should we consider only absolute values?)

library(mia) library(ggtree) library(scater) library(ggcorrplot) data("GlobalPatterns", package = "mia") tse <- GlobalPatterns tse <- logNormCounts(tse)

Agglomerate to keep only high prevalence features

tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)

Achieve PCA reduction

tse <- runPCA(tse, name = "PCA", ncomponents = 2)

Get the feature loadings

loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")

Plot feature loadings

ggcorrplot(abs(loadings_matrix), type = "lower", method="circle", title="Feature loadings", ggtheme=theme_bw) + scale_fill_gradient2(breaks=c(0, 1), limit=c(0, 1))

antagomir commented 1 day ago

Thanks @ElySeraidarian

Could we formulate a miaViz-style function for one of these, so that the plotting can be done on a single line of code, given that the user provides a TreeSE object that has a) feature loading matrix (in some position that we can change later, i.e. in rowData or in metadata or in attr) and b) (optionally) a rowTree.

In the first round at least we can assume that the user has themselves done the necessary agglomerations etc. There are so many considerations on those things that we cannot automated all possible scenarios on the data sizes anyway.

TuomasBorman commented 1 day ago

Looks very nice and interesting. However, I am wondering if that is the most usual way to plot loadings ( I don't know). At least for eigenvalues the most usual way might be scree plot (bar plot). We could have function to plot both loadings and eigenvalues perhaps since both are needed to describe the ordination results fully.

library(mia)
library(ggplot2)
library(patchwork)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, method = "clr", pseudocount = 1)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", assay.type = "clr")
# Get the feature loadings
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
# Add taxonomy labels to column
df[["feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
# Get top 50 loadings
df <- df[ order(abs(df[["PC1"]]), decreasing = TRUE)[1:50], ]
# Order loadings and add factor to keep the order
df <- df[ order(df[["PC1"]]), ]
df[["feature"]] <- factor(df[["feature"]], levels = df[["feature"]])

# Loadings plot
p1 <- ggplot(df, aes(x = .data[["PC1"]], .data[["feature"]])) +
    geom_bar(stat="identity")

eig <- attr(reducedDim(tse, "PCA"), "varExplained")
eig <- eig / sum(eig)
df <- data.frame(eig_vals = eig)
# Get percentage of explained
df[["PC"]] <- paste0("PC", seq_len(nrow(df)))
df[["PC"]] <- factor(df[["PC"]], levels = df[["PC"]])
# Scree plot
p2 <- ggplot(df, aes(x = .data[["PC"]], .data[["eig_vals"]])) +
    geom_bar(stat="identity")

p1 + p2

image

antagomir commented 1 day ago

Only some ordination methods provide eigenvalues (e.g. PCA, MDS) and I think only in PCA (out of the commonly used ordinations) the eigenvalues have an actual clear + commonly used interpretation. Thus a general function for this might not be feasible.

It would be interesting to experiment with different ways to visualize loading matrices. The barplot/screeplot is indeed very common. If we have generic function to visualize loadings, then this could give different options, like bar/screeplot, heatmap, etc.

However, heatmap (options 2) seems to me like a good alternative. In there, one should show also the sign of the loading (e.g. with different colors for negative and positive values); also do not use two different indicators (like size and color) to denote the same thing. If you could use native ggplot2 instead of ggcorrplot that could be useful (reducing package dependencies is in general good). I would swap the figure for better readability of species names.

The tree + heatmap is very nice since usually the loadings do not show the tree but this can be useful. The idea has been borrowed from Susan Holmes' phyloseq work.

ElySeraidarian commented 20 hours ago

Thanks for your returns, indeed it will be possible to create a call for a one line function that will do the plotting. We can add different options to visualize loadings so that the user will find the best for himself.

I have not been able to plot the exact same plot using ggplot2 but here is an example I have build based on your returns

library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
df[["Feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
colours = c("red","white", "blue")

dfgroup <- data.frame(PC = c(df$PC1,df$PC2), Feature = c(df$Feature, df$Feature), group = c(rep("PC1",length(df$PC1)), rep("PC2", length(df$PC2))))

# Plot feature loadings
ggplot(dfgroup, aes(x = .data[["PC"]], y = .data[["Feature"]], group = as.factor(group), col = as.factor(group)))  +
  geom_point(size=6) +
  labs(title="Feature Loadings Plot") +
  scale_colour_manual(name = "Principal Components",
   values = c("red","blue")) +
  xlim(-1, 1)

image

It is a bit different as the loadings value differs from y-axis and there is no colour change for positive/negative values.

So here is also the improved version using ggcorrplot in case we want to use it

library(ggcorrplot)
library(mia)
library(ggtree)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
# Plot feature loadings
ggcorrplot(loadings_matrix,
           type = "lower",
           method="circle",
           title="Feature loadings",
           colors = colours,
           ggtheme=theme_bw) + coord_flip()

image

antagomir commented 18 hours ago

I would not combine the two axes like in the first plot. PC1 and PC2 we wouldn't compare directly in most cases. The original ggcorr plot looked better in fact.

A related idea, how about plotting just one axis at a time (as heatmap), sorting the values in increasing order, and then using colors to illustrate the increasing values (instead of bars)?

Probably the standard barplot / screeplot would be useful to include as option as well.

Note: often the user is interested on just a few top features (e.g. 5-10). This might help to find additional complementary options, too.

To get fwd, can we create one function that has the more standard barplot, and one heatmap version as options?