microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Compute and plot feature loadings in ordination methods #596

Open thpralas opened 2 days ago

thpralas commented 2 days ago

With @ElySeraidarian, we have been working on computing and plotting feature loadings for several ordination methods. Currently, we can compute and plot the feature loadings for 3 ordination methods: PCA, LDA and NMF. However, we are uncertain about the optimal way to implement this feature in mia (and possibly miaViz for the visualization part).

First, we are uncertain about the optimal way to store feature loadings after calculation. The calculation of feature loadings is different for all methods but they all result in a matrix (features in lines and latent vectors in columns). Ideally, we would like to store the feature loadings at the same place for all ordination methods (e.g. attr or metadata) but it might be difficult. The ordination functions we use to compute feature loadings are not storing them at the same place at all. For instance, when runningrunPCA, the feature loadings matrix is directly stored in attribute "rotation".

Also, we are wondering if it would be best to create a generic wrapper for plotting. @antagomir suggested that we build specific functions for each method and then a generic wrapper that would recognize the type of arguments and call the relevant method.

I think it is important to mention that to visualize the rowTree as well as the feature loadings, we need to restrict the number of features so that the ordination plot remains readable. For that, we use agglomerateByPrevalence (which also agglomerate the rowTree thanks to https://github.com/microbiome/mia/pull/575)

thpralas commented 1 day ago

@TuomasBorman Do you have any suggestions or preferences regarding this?

TuomasBorman commented 1 day ago

1. Can you list all the methods that you are implementing? I believe NMF and LDA are not currently available from TreeSE/SingleCellExperiment ecosystem?

PCA --> scater::runPCA() RDA --> mia::runRDA()/getRDA()

I think first task is to implement LDA and NMF (if they are not yet available). There should be both get and add methods available.

get should output ordination including feature loadings in attribute, that might be the best solution. add should add the ordination result to reducedDim slot.

2.

Wrapper for feature loadings was already discussed briefly here https://github.com/microbiome/miaViz/issues/92. Certainly there is a need for wrapper. That could take either TreeSE or feature loadings matrix as input.

Leo's idea of specific functions and generic wrapper seems reasonable. However, you should keep in mind the maintenance issues; you should avoid "copy-pasting" the code. Here it seems that loadings are in same format (matrix), and only thing that differs is from where are they fetched from TreeSE object. If that is the case, the data fetching part is the only thing that requires ordination-specific method. The rest of the functionality can be shared by all methods. (Not sure though since I have not checked this thing in detail).

I am not sure about the use-case of plotting loadings along with rowTree().

In any case, you should first create a draft file that just has the functionality without formal function-specific stuff. For plotting wrapper, you could use OMA example as a template and make it general. This seems a little bit bigger task, so the ground work is very important; especially if you don't have time to finish this.

I might be slow to respond on July because I am working only part-time.

antagomir commented 1 day ago

It could also help to proceed if visualization methods can be already provided for at least one of the readily available ordination functions. All ordination methods are supposed to provide similar output (features x components) and the visualization method can assume that the output is available in a suitable format.

Tree: at the first stage it could be even sufficient to just assume that the user has subsetted or agglomerated the tree themselves before the analysis, in whatever way they wish. The examples could be done with sufficiently small data sets with not too many features. I am not sure if the plotting method itself is the best place to deal with this. At least not on the first round. In the subsequent steps we can see how to scale up.

I support comments from @TuomasBorman . I also suggest that here in mia issues we discuss details of the ordination method, and in the indicated miaViz issue we discuss the visualization details.

antagomir commented 1 day ago

I agree about creating the get functions for NMF and LDA (similar to getDMM and others that we have already). Or at least one of these to begin with.

All get functions should then return the ordination loadings in the same way. Yes, this can be an issue when we use external functions such as scater::runPCA. We cannot solve that unless we a) implement our on mia::getPCA wrapper on top of it ; b) suggest PR to scater to change this (no idea how it would be received there); c) create a general conversion function getLoadings() that can pick the loadings from different packages & outputs. I might be in favor of option (a).

thpralas commented 20 hours ago

Can you list all the methods that you are implementing? I believe NMF and LDA are not currently available from TreeSE/SingleCellExperiment ecosystem?

PCA --> scater::runPCA() RDA --> mia::runRDA()/getRDA()

I think first task is to implement LDA and NMF (if they are not yet available). There should be both get and add methods available.

get should output ordination including feature loadings in attribute, that might be the best solution. add should add the ordination result to reducedDim slot.

Wrapper for feature loadings was already discussed briefly here microbiome/miaViz#92. Certainly there is a need for wrapper. That could take either TreeSE or feature loadings matrix as input.

Leo's idea of specific functions and generic wrapper seems reasonable. However, you should keep in mind the maintenance issues; you should avoid "copy-pasting" the code. Here it seems that loadings are in same format (matrix), and only thing that differs is from where are they fetched from TreeSE object. If that is the case, the data fetching part is the only thing that requires ordination-specific method. The rest of the functionality can be shared by all methods. (Not sure though since I have not checked this thing in detail).

I am not sure about the use-case of plotting loadings along with rowTree().

In any case, you should first create a draft file that just has the functionality without formal function-specific stuff. For plotting wrapper, you could use OMA example as a template and make it general. This seems a little bit bigger task, so the ground work is very important; especially if you don't have time to finish this.

I might be slow to respond on July because I am working only part-time.

Here are the examples I have for LDA and NMF ordinations, I use topicmodels::LDA and NMF::nmf:

# NMF example
library(NMF)
library(mia)

data(GlobalPatterns)
tse <- GlobalPatterns

tse <- transformAssay(tse, method = "relabundance")

altExp(tse, "GenusPrevalent") <- agglomerateByPrevalence(tse, rank="Genus", assay.type="relabundance",
                                                         detection = 3/100, prevalence=5/100)

altExp(tse, "GenusPrevalent") <- altExp(tse, "GenusPrevalent")[!(rownames(altExp(tse, "GenusPrevalent")) 
                                                                 %in% c("Other")),]

x <- t(assay(altExp(tse, "GenusPrevalent"), "counts"))
nmf2 <- nmf(x, 2)

H <- nmf2@fit@H

feature_loadings <- t(H)
head(feature_loadings)
# LDA example
library(mia)
library(topicmodels)

data(GlobalPatterns)
tse <- GlobalPatterns

tse <- transformAssay(tse, method = "relabundance")

altExp(tse, "GenusPrevalent") <- agglomerateByPrevalence(tse, rank="Genus", assay.type="relabundance",
                                                         detection=3/100, prevalence=5/100)

altExp(tse, "GenusPrevalent") <- altExp(tse, "GenusPrevalent")[!(rownames(altExp(tse,
                                                                                 "GenusPrevalent")) %in% c("Other")),]

lda_model <- LDA(t(assay(altExp(tse, "GenusPrevalent"), "counts")), k = 2)

df <- as.data.frame(t(assay(altExp(tse, "GenusPrevalent"), "counts")))

posteriors <- topicmodels::posterior(lda_model, df)

feature_loadings <- t(as.data.frame(posteriors$terms))
head(feature_loadings)

I can open a PR to implement get and add methods for LDA and NMF, and I agree that a wrapper for PCA would be great to store feature loadings exactly the same way for all 3 ordination methods.

antagomir commented 18 hours ago

I also suggest that you implement wrapper only for either LDA or NMF first. Then we can make sure that the recommended practices are being followed, afterwards do the other method following the same procedure.