Open TuomasBorman opened 10 months ago
I would like to provide some examples on how to plot these feature loadings after talking with @antagomir about this.
There would be 2 ways of plotting :
This requires the latest github version of mia.
1) With tree, this option means that the tree need to be agglomerated to the same features as the loadings matrix we get after performing the reduction method (this is now possible for LDA & NMF methods as agglomerateByPrevalence has been fixed to be possible to agglomerate tree). Here are quick examples for each of these reduction methods :
library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
phylo <- rowTree(tse)
circ <- ggtree(phylo, layout = "circular")
df <- rowData(tse)
color <- randomcoloR::distinctColorPalette(
length(
unique(
df$Phylum
)
)
)
df <- data.frame(Class = df$Phylum)
rownames(df) <- phylo$tip.label
df2 <- data.frame(abs(loadings_matrix)) rownames(df2) <- phylo$tip.label
p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )
for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) )
p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) }
p
- NMF
library(NMF) library(mia) library(ggtree) data(GlobalPatterns) tse <- GlobalPatterns tse <- transformAssay(tse, method = "relabundance")
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
x <- t(assay(tse, "counts")) nmf2 <- nmf(x, 2) H <- nmf2@fit@H W <- nmf2@fit@W
feature_loadings <- t(H) phylo <- rowTree(tse) circ <- ggtree(phylo, layout = "circular") df <- rowData(tse) color <- randomcoloR::distinctColorPalette( length( unique( df$Phylum ) ) ) df <- data.frame(Class = df$Phylum) rownames(df) <- phylo$tip.label
df2 <- data.frame(feature_loadings) rownames(df2) <- phylo$tip.label
p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )
for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) ) p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) } p
- LDA
library(mia) library(topicmodels) library(ggtree)
data(GlobalPatterns) tse <- GlobalPatterns
tse <- transformAssay(tse, method = "relabundance")
tse <- agglomerateByPrevalence(tse, rank = "Class", update.tree = TRUE, prevalence = 0.99)
lda_model <- LDA(t(assay(tse, "counts")), k = 2)
df <- as.data.frame(t(assay(tse, "counts")))
posteriors <- topicmodels::posterior(lda_model, df)
feature_loadings <- t(as.data.frame(posteriors$terms))
phylo <- rowTree(tse) circ <- ggtree(phylo, layout = "circular") df <- rowData(tse) color <- randomcoloR::distinctColorPalette( length( unique( df$Phylum ) ) ) df <- data.frame(Class = df$Phylum) rownames(df) <- phylo$tip.label
df2 <- data.frame(feature_loadings) rownames(df2) <- phylo$tip.label
p <- gheatmap( p = circ, data = df, offset = -.1, width = .1, colnames_angle = 95, colnames_offset_y = .5, font.size = 4, color = "black") + ggplot2::scale_fill_manual( values = color, name = "Class" )
for(i in 1:2){ if(i == 1){ p <- p + ggnewscale::new_scale_fill() } df3 <- dplyr::select( df2, (all_of(i)) ) p <- gheatmap( p, df3, offset = i*.065, width = .1, colnames_angle = 90, font.size = 4, high = "darkslateblue", low = "gray98", color = "black", legend_title = expression(beta[k])) } p
This is the exact same process for every reduction method, the only thing that changes is how the loadings matrix is retrieved and stored which is why wrappers discussed in [microbiome/mia#596](https://github.com/microbiome/mia/issues/596) could be helpful. As also discussed it might be possible to recognize the type of reduction method so that the user do not have to specify method when calling function.
Note that we agglomerate with a high prevalence in order to have only few features so that it can be understandable (and that the legend does not get out of the screen).
2) Without tree, I have made an example for PCA, there are still things that can be changed (e.g. adding coefficients inside circles?, should we consider only absolute values?)
library(mia) library(ggtree) library(scater) library(ggcorrplot) data("GlobalPatterns", package = "mia") tse <- GlobalPatterns tse <- logNormCounts(tse)
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
ggcorrplot(abs(loadings_matrix), type = "lower", method="circle", title="Feature loadings", ggtheme=theme_bw) + scale_fill_gradient2(breaks=c(0, 1), limit=c(0, 1))
Thanks @ElySeraidarian
Could we formulate a miaViz-style function for one of these, so that the plotting can be done on a single line of code, given that the user provides a TreeSE object that has a) feature loading matrix (in some position that we can change later, i.e. in rowData or in metadata or in attr) and b) (optionally) a rowTree.
In the first round at least we can assume that the user has themselves done the necessary agglomerations etc. There are so many considerations on those things that we cannot automated all possible scenarios on the data sizes anyway.
Looks very nice and interesting. However, I am wondering if that is the most usual way to plot loadings ( I don't know). At least for eigenvalues the most usual way might be scree plot (bar plot). We could have function to plot both loadings and eigenvalues perhaps since both are needed to describe the ordination results fully.
library(mia)
library(ggplot2)
library(patchwork)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, method = "clr", pseudocount = 1)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", assay.type = "clr")
# Get the feature loadings
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
# Add taxonomy labels to column
df[["feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
# Get top 50 loadings
df <- df[ order(abs(df[["PC1"]]), decreasing = TRUE)[1:50], ]
# Order loadings and add factor to keep the order
df <- df[ order(df[["PC1"]]), ]
df[["feature"]] <- factor(df[["feature"]], levels = df[["feature"]])
# Loadings plot
p1 <- ggplot(df, aes(x = .data[["PC1"]], .data[["feature"]])) +
geom_bar(stat="identity")
eig <- attr(reducedDim(tse, "PCA"), "varExplained")
eig <- eig / sum(eig)
df <- data.frame(eig_vals = eig)
# Get percentage of explained
df[["PC"]] <- paste0("PC", seq_len(nrow(df)))
df[["PC"]] <- factor(df[["PC"]], levels = df[["PC"]])
# Scree plot
p2 <- ggplot(df, aes(x = .data[["PC"]], .data[["eig_vals"]])) +
geom_bar(stat="identity")
p1 + p2
Only some ordination methods provide eigenvalues (e.g. PCA, MDS) and I think only in PCA (out of the commonly used ordinations) the eigenvalues have an actual clear + commonly used interpretation. Thus a general function for this might not be feasible.
It would be interesting to experiment with different ways to visualize loading matrices. The barplot/screeplot is indeed very common. If we have generic function to visualize loadings, then this could give different options, like bar/screeplot, heatmap, etc.
However, heatmap (options 2) seems to me like a good alternative. In there, one should show also the sign of the loading (e.g. with different colors for negative and positive values); also do not use two different indicators (like size and color) to denote the same thing. If you could use native ggplot2 instead of ggcorrplot that could be useful (reducing package dependencies is in general good). I would swap the figure for better readability of species names.
The tree + heatmap is very nice since usually the loadings do not show the tree but this can be useful. The idea has been borrowed from Susan Holmes' phyloseq work.
Thanks for your returns, indeed it will be possible to create a call for a one line function that will do the plotting. We can add different options to visualize loadings so that the user will find the best for himself.
I have not been able to plot the exact same plot using ggplot2 but here is an example I have build based on your returns
library(mia)
library(ggtree)
library(scater)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
df <- attr(reducedDim(tse, "PCA"), "rotation") |> as.data.frame()
df[["Feature"]] <- getTaxonomyLabels(tse)[ match(rownames(df), rownames(tse)) ]
colours = c("red","white", "blue")
dfgroup <- data.frame(PC = c(df$PC1,df$PC2), Feature = c(df$Feature, df$Feature), group = c(rep("PC1",length(df$PC1)), rep("PC2", length(df$PC2))))
# Plot feature loadings
ggplot(dfgroup, aes(x = .data[["PC"]], y = .data[["Feature"]], group = as.factor(group), col = as.factor(group))) +
geom_point(size=6) +
labs(title="Feature Loadings Plot") +
scale_colour_manual(name = "Principal Components",
values = c("red","blue")) +
xlim(-1, 1)
It is a bit different as the loadings value differs from y-axis and there is no colour change for positive/negative values.
So here is also the improved version using ggcorrplot in case we want to use it
library(ggcorrplot)
library(mia)
library(ggtree)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
tse <- logNormCounts(tse)
# Agglomerate to keep only high prevalence features
tse <- agglomerateByPrevalence(tse, rank="Phylum", prevalence=0.99, update.tree = TRUE)
# Achieve PCA reduction
tse <- runPCA(tse, name = "PCA", ncomponents = 2)
# Get the feature loadings
loadings_matrix <- attr(reducedDim(tse, "PCA"), "rotation")
# Plot feature loadings
ggcorrplot(loadings_matrix,
type = "lower",
method="circle",
title="Feature loadings",
colors = colours,
ggtheme=theme_bw) + coord_flip()
I would not combine the two axes like in the first plot. PC1 and PC2 we wouldn't compare directly in most cases. The original ggcorr plot looked better in fact.
A related idea, how about plotting just one axis at a time (as heatmap), sorting the values in increasing order, and then using colors to illustrate the increasing values (instead of bars)?
Probably the standard barplot / screeplot would be useful to include as option as well.
Note: often the user is interested on just a few top features (e.g. 5-10). This might help to find additional complementary options, too.
To get fwd, can we create one function that has the more standard barplot, and one heatmap version as options?
In OMA, there is an example on how to plot loadings of features in dbRDA. We could consider to generalize this to plotting function if that is a common way to analyze which features drive the difference