LDA ordination with feature loadings computation

thpralas commented 1 week ago

This PR adds two new methods getLDA and addLDA that perform Latent Dirichlet Allocation.

getLDA returns the ordination matrix with the feature loadings matrix stored as attribute. addLDA adds this ordination matrix to reducedDims of the TreeSummarizedExperiment.

This PR is related to issue https://github.com/microbiome/mia/issues/596.

antagomir commented 1 week ago

Great! I will be first looking fwd to comments from @TuomasBorman and then let's see if I have anything to add.

antagomir commented 1 week ago

Conflicts to resolve?

TuomasBorman commented 1 week ago

Couple things to discuss:

LDA has features from both ordination and clustering methods, however, it is more clustering method. That should be taken into account in documentation.
As this is clustering method, should we add clusters to rowData/colData? In theory, LDA can be applied also to rows, which makes reducedDim not suitable for storing the result. (Instead of ordination methods, addCluster() should be used as an example)
And finally, should this be implemented in bluster package (the Bioconductor clustering package)? We could make a PR as we did last year https://github.com/LTLA/bluster/blob/master/R/DmmParam.R

antagomir commented 1 week ago

Intuitively, reducedDim is a logical place given that this would be usually done for samples. But it could be in principle applied on both rows and columns. The function generates both feature matrix and feature loading matrix, both should be stored.

LDA is a matrix decomposition method, and I would not recommend to link it with clustering so directly.

But it could in principle go to the bluster package. Not sure if they would approve but it could be asked by opening an issue. That might take more time, however.. therefore I would meanwhile proceed with implementing these in mia, and later we can move them elsewhere if necessary? LDA and NMF have clear uses, it would be useful to have them easily accessible.

TuomasBorman commented 1 week ago

With these explanations, I am fine with mia implementation and reducedDim()

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 66.66667% with 16 lines in your changes missing coverage. Please review.

Please upload report for BASE (devel@dea9d92). Learn more about missing BASE report.

Files	Patch %	Lines
R/utils.R	31.57%	13 Missing :warning:
R/addLDA.R	89.65%	3 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## devel #598 +/- ## ======================================== Coverage ? 68.08% ======================================== Files ? 44 Lines ? 5273 Branches ? 0 ======================================== Hits ? 3590 Misses ? 1683 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

TuomasBorman commented 1 week ago

Usually, the number of topics "k" is determined by iterating over several numbers. The best fit is evaluated with coherence and perplexity metrics.

Should k be a vector of values so that the function iterates over them calculating LDA for all different number of topics? The function could evaluate the goodness of fit and then return the best model. The function could also return a table which tells the metrics for each k value (that could be then easily plotted).

@antagomir

antagomir commented 6 days ago

Ok to me to add that.

However, if it feels a lot to add now, I would focus on finalizing the present one (with a single fixed k) first, and then adding that multi-k optimization as a separate PR.

TuomasBorman commented 6 days ago

That would require something like this

addLDA <- function(eval.metric = "perplexity"){
    if(eval.metric == "coherence" && !require("topicdoc") ){
        stop("'topicmodel' package must be installed in order to calculate coherence.")
    }
    models <- lapply(k, function(x){
        LDA(k = x)
    }

    metric <- .calc_metrics(models)

    # Get the model with lowest eval.metric (dfault value perplexity).

    # Get scores
    # Add loadings, model, and metrics data.frame to attributes

}

.calc_metrics <- function(x){
    # Loop through models
    # Calculate perplexity with topicmodels
    # Calculate coherence (I believe you have to use topicdoc package). Calculate it if that package is available
    # Create df from both metrics
}

TuomasBorman commented 1 day ago

NMF would go with this same template I think. However, there k (rank) can be numeric vector and the function finds optimal k by itself

TuomasBorman commented 1 day ago

For LDA we could have another PR for multiple k values, if you are not able to implement it now

The idea was this:

data("GlobalPatterns")
tse <- GlobalPatterns
tse <- agglomerateByPrevalence(tse, rank="Phylum")

eval.metric <- "perplexity"
k <- c(2, 3, 4, 5, 6, 7, 8, 9, 10)
k <- 5
assay.type <- "counts"

mat <- assay(tse, assay.type)

df <- as.data.frame(t(mat))

if( eval.metric == "coherence" ){
    .require_package("topicdoc")
}

models <- lapply(k, function(i){
    topicmodels::LDA(df, i)
})

metrics <- .calculate_lda_metrics(models, k)

lda_model <- models[[ which.max(metrics[[eval.metric]]) ]]

posteriors <- topicmodels::posterior(lda_model, df)
scores <- t(as.data.frame(posteriors$topics))
attr(scores, "loadings") <- loadings
attr(scores, "model") <- lda_model
attr(scores, "eval_metric") <- metrics 

.calculate_lda_metrics <- function(models, k){
    metrics <- lapply(models, function(model){
        res <- topicmodels::perplexity(model)
        names(res) <- "perplexity"
        if( require("topicdoc") ){
            coherence <- topicdoc::topic_coherence(model, df)
            names(coherence) <- paste0("coherence_", seq_len(length(coherence)))
            coherence <- c(coherence, coherence_mean = mean(coherence))
            res <- c(res, coherence)
        }
        return(res)
    })
    metrics <- as.data.frame( dplyr::bind_rows(metrics) )
    metrics[["k"]] <- k
    return(metrics)
}

#################################
df <- attr(scores, "eval_metric")

library(ggplot2)

ggplot(df) +
    geom_point(aes(x = k, y = perplexity), color = "red")
ggplot(df) +
    geom_point(aes(x = k, y = mean_coherence), color = "blue")

microbiome / mia

LDA ordination with feature loadings computation #598

Codecov Report