microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Get top explanatory features from PCoA loadings #541

Open RiboRings opened 1 month ago

RiboRings commented 1 month ago

Hi! Currently, when performing ordination (PCoA), there is no straightforward way to retrieve the original features that contribute the most to the reduced dimensions. Every time, the user would need to manually correlate features to PCoA loadings for example like this:

res <- apply(
      assay(se, assay.type),
      MARGIN = 1, simplify = FALSE,
      function(x) cor(x, reducedDim(se, dimred), method = method)
  )

  mat <- do.call(rbind, res)
  colnames(mat) <- paste0(dimred, seq(ncol(mat)))
  rownames(mat) <- rownames(se)

  df <- as.data.frame(mat)

I would propose to create a function that takes dimred of interest and correlates loading to an assay.type of interest, and then returns the top n features for every dimension in the form of a dataframe. If relevant, both get and add methods can be developed.

I have a working draft so a PR about this can be opened soon. Do you think that getPrincipalFeatures (in reference to principal components) would be a suitable name? Where is the best place to define it, maybe in summaries.R next to the other getFeatures utilities?

antagomir commented 1 month ago

Great, thanks for suggesting.

Some considerations:

1) This would obviously be a generally applicable shortcut / approximation that works for any reducedDim. For many of them (in particular the nonlinear methods, like MDS/PCoA, NMDS, t-SNE, UMAP etc), there is no direct equivalent of component loadings, and sometimes top correlating features are used to provide rough characterization of the important variables corresponding to the axis. I agree that it could be useful to make this easier as it is a frequent need in fast data exploration. Using this comes with quite some responsibility, however. The user should understand the limitations of this approach, and the approximate nature of the relation. Otherwise misleading interpretations might follow. Perhaps the function name should also reflect this? Like getReducedDimCorrelates.. or something.

2) Methods like PCA directly generate real component loadings. Those should be preferable over some approximated correlations. I am not really sure how to avoid misuse in such cases. At least examples should be clear about these two different cases. Or perhaps metadata slot should contain information if the loadings are readily available, and the suggested function would first check if real loadings are already available, and throw warning if user tries to add approximated versions.. this could be useful if there is an automated visualization function for this. It could by default search for the real loadings, and if those are not found, calculate the correlations.

3) Or could we readily use the getCrossAssociations function?

4) If this function is created, perhaps it should calculate the associations only for the first 2 principal axes by default? Or at least the visualization examples should show that.

5) Could be useful then to have a related miaViz method to provide visualization.

I think it could be useful to create a method and take it into use, then we have more experience using it before concluding how to develop further.

antagomir commented 1 month ago

One reason for not making this too simple is to avoid application by users who don't understand the methods and implications.. therefore one option is also to make a first version (including visualization style) just for the ordinary PCA, for which proper component loadings exist. Then we can use that to streamline the procedure and develop the ideal visualization, while considering how to do this for the non-linear methods.

RiboRings commented 1 month ago

Thank you for the thorough remarks! Considering what you said, perhaps would it be more convenient to extend getCrossAssociation to reducedDims? I suppose keeping it as generic as possible, users may not feel as compelled to misuse it.

Do you think adding this functionality to getCrossAssociation is a good option? Of course, also in this case we can raise warnings and checks to disencourage misuse.

antagomir commented 1 month ago

1) Shortcut solution (correlate features with ordination axes). This option seems to be available in QIIME2 also, the page that I found was broken but it might be possible to find this elsewhere in their manuals. In principle your suggested solution could be neat. In practice it would add complexity on the getCrossAssociation function, which is already rather complex with support for altExps, assays, colData etc. Since altExps are also TreeSEs, I am in fact not sure if altExps really need their own support, that could be abandoned, and reducedDim support added to not add more complexity. Perhaps @TuomasBorman has some comments on this. After calculating the cross-associations, some way to visualize the loadings alongside ordination plot need to be developed. It would be good idea to check inspiration from QIIME2, Mothur, phyloseq, and possibly other frameworks to start with.

2) Rigorous solution is to visualize the loadings that can be calculated in a more principled way. In PCA these are naturally available as the linear loadings. In PCoA, methods have been developed as well (in particular, DPCoA). We should here cite Fukuyama et al. (2012). EMDUnifrac might also be an option but not sure if R implementation is available. There are several examples on making biplots for PCoA (see e.g. ape, workshop, qiime2). Finally, related techniques are available for NMDS (see this1 and this2 and this3). In the longer run, we would like to support the more rigorous solutions for the standard ordinations like PCA, PCoA/MDS and NMDS. In the short run, solution to (1) might be feasible as a rough exploratory method.

If these two cannot be solved on the same go at least new issues could be opened on each?

TuomasBorman commented 1 month ago
  1. I agree that getCrossAssociation might become too complex if we just keep adding stuff. Related to this PR https://github.com/microbiome/mia/pull/484, we had an idea to support altExp in all functions but in "hidden way" --> user could still use altExp but that option not visible to reduce complexity. Anyway, I think altExp could be dropped from getCrossAssociation function.

We could first try to add support for reducedDim directly to getCrossAssociation. If that makes the function too complex, we could create new function that still utilizes getCrossAssociation internally. (Better to have one workhorse that we keep improving)

RiboRings commented 1 month ago

I quickly checked getCrossAssociation and I think it would take me a very long time to familiarise with it and add the reducedDim functionality. Unless you would like to give it a try, I might opt for a new function specific for assay to dimred comparison.

antagomir commented 1 month ago

Seems best to go with an option that utilizes getCrossAssociation internally. The crossAssociation solution could have use for correlating reducedDims not just to assay features but also to colData. That is also relevant (e.g. which diet components from colData correlate best with the first PC axis etc. and how to visualize that).

But the main motivation for this currently seems to be the ability to characterize driving features on each PC axis for PCA & PCoA. Potentially also for NMDS. I would not recommend UMAP and t-SNE by default, and interpretation this way would be even more problematic for those methods.

This still makes me wonder whether we could first focus on providing the rigorous and theoretically justified visualization solutions for PCA & PCoA (or one of these) first? Links to the solutions are provided above.

RiboRings commented 1 month ago

It sounds like a good option. And while we don't have a solution that uses getCrossAssociation, for now we could calculate correlations on the fly within the visualisation function?

antagomir commented 1 month ago

Hmm, there shouldn't be a need to calculate correlations for the rigorous option (PCA and PCoA)?

We can work through an example but essentially, we can dig out variable loadings for PCA and PCoA from the method itself. For PCA directly, for PCoA a bit more indirectly.