quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Non-negative matrix factorization #44

Open michalovadek opened 3 years ago

michalovadek commented 3 years ago

I have been using non-negative matrix factorization (NMF) for topic modelling (as an alternative to LDA) for a while now, but so far I have not been able to find a good R package for this. In my limited experience, the NMF package is a bit of a mess that does not work properly due to being heavily spiked with Bioconductor dependencies and when I did manage to make it work, it seemed slow. The other two packages that can do NMF are NMFN and rNMF. I have found both to be rather slow.

My solution so far has been to use reticulate:

library(reticulate)

use_condaenv("r-reticulate")

sklearn <- import("sklearn")

decomp <- py_run_string("from sklearn import decomposition")

model <- decomp$decomposition$NMF(init="nndsvd", n_components= as.integer(15),
                                   random_state = as.integer(23))

W = model$fit_transform(your_matrix)
H = model$components_

This works well, but native R support would be obviously better. I don't know how difficult it would be to port the Python solution to R or optimize the existing packages, but I thought I would raise this here in case you thought this was a worthy addition to the quanteda.textmodels family. I read your discussion about supporting LDA, but I think the way NMF works is somewhat more conducive to being directly supported here (plus the fact that unlike LDA, there aren't good alternatives out there).

Greene and Cross 2017 take this a step further (and generally make the case for NMF topic modelling), but for starters a fast NMF decomposer that actually works (with text data) would be nice.

kbenoit commented 3 years ago

Agreed, this would be a very good addition. I saw a talk last month about guided NMF that was all about text as well. https://arxiv.org/pdf/2010.11365.pdfhttps://github.com/jvendrow/GuidedNMF

I am pretty sure we could get the R packages working more cleanly and for sparse inputs.