Open thpralas opened 1 week ago
Great! I will be first looking fwd to comments from @TuomasBorman and then let's see if I have anything to add.
Conflicts to resolve?
Couple things to discuss:
Intuitively, reducedDim is a logical place given that this would be usually done for samples. But it could be in principle applied on both rows and columns. The function generates both feature matrix and feature loading matrix, both should be stored.
LDA is a matrix decomposition method, and I would not recommend to link it with clustering so directly.
But it could in principle go to the bluster package. Not sure if they would approve but it could be asked by opening an issue. That might take more time, however.. therefore I would meanwhile proceed with implementing these in mia, and later we can move them elsewhere if necessary? LDA and NMF have clear uses, it would be useful to have them easily accessible.
With these explanations, I am fine with mia implementation and reducedDim()
Attention: Patch coverage is 66.66667%
with 16 lines
in your changes missing coverage. Please review.
Please upload report for BASE (
devel@dea9d92
). Learn more about missing BASE report.
Files | Patch % | Lines |
---|---|---|
R/utils.R | 31.57% | 13 Missing :warning: |
R/addLDA.R | 89.65% | 3 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Usually, the number of topics "k" is determined by iterating over several numbers. The best fit is evaluated with coherence and perplexity metrics.
Should k be a vector of values so that the function iterates over them calculating LDA for all different number of topics? The function could evaluate the goodness of fit and then return the best model. The function could also return a table which tells the metrics for each k value (that could be then easily plotted).
@antagomir
Ok to me to add that.
However, if it feels a lot to add now, I would focus on finalizing the present one (with a single fixed k) first, and then adding that multi-k optimization as a separate PR.
That would require something like this
addLDA <- function(eval.metric = "perplexity"){
if(eval.metric == "coherence" && !require("topicdoc") ){
stop("'topicmodel' package must be installed in order to calculate coherence.")
}
models <- lapply(k, function(x){
LDA(k = x)
}
metric <- .calc_metrics(models)
# Get the model with lowest eval.metric (dfault value perplexity).
# Get scores
# Add loadings, model, and metrics data.frame to attributes
}
.calc_metrics <- function(x){
# Loop through models
# Calculate perplexity with topicmodels
# Calculate coherence (I believe you have to use topicdoc package). Calculate it if that package is available
# Create df from both metrics
}
NMF would go with this same template I think. However, there k (rank) can be numeric vector and the function finds optimal k by itself
For LDA we could have another PR for multiple k values, if you are not able to implement it now
The idea was this:
data("GlobalPatterns")
tse <- GlobalPatterns
tse <- agglomerateByPrevalence(tse, rank="Phylum")
eval.metric <- "perplexity"
k <- c(2, 3, 4, 5, 6, 7, 8, 9, 10)
k <- 5
assay.type <- "counts"
mat <- assay(tse, assay.type)
df <- as.data.frame(t(mat))
if( eval.metric == "coherence" ){
.require_package("topicdoc")
}
models <- lapply(k, function(i){
topicmodels::LDA(df, i)
})
metrics <- .calculate_lda_metrics(models, k)
lda_model <- models[[ which.max(metrics[[eval.metric]]) ]]
posteriors <- topicmodels::posterior(lda_model, df)
scores <- t(as.data.frame(posteriors$topics))
attr(scores, "loadings") <- loadings
attr(scores, "model") <- lda_model
attr(scores, "eval_metric") <- metrics
.calculate_lda_metrics <- function(models, k){
metrics <- lapply(models, function(model){
res <- topicmodels::perplexity(model)
names(res) <- "perplexity"
if( require("topicdoc") ){
coherence <- topicdoc::topic_coherence(model, df)
names(coherence) <- paste0("coherence_", seq_len(length(coherence)))
coherence <- c(coherence, coherence_mean = mean(coherence))
res <- c(res, coherence)
}
return(res)
})
metrics <- as.data.frame( dplyr::bind_rows(metrics) )
metrics[["k"]] <- k
return(metrics)
}
#################################
df <- attr(scores, "eval_metric")
library(ggplot2)
ggplot(df) +
geom_point(aes(x = k, y = perplexity), color = "red")
ggplot(df) +
geom_point(aes(x = k, y = mean_coherence), color = "blue")
This PR adds two new methods
getLDA
andaddLDA
that perform Latent Dirichlet Allocation.getLDA
returns the ordination matrix with the feature loadings matrix stored as attribute.addLDA
adds this ordination matrix toreducedDims
of the TreeSummarizedExperiment.This PR is related to issue https://github.com/microbiome/mia/issues/596.