Open koheiw opened 4 years ago
I manage to make GibbsLDA++ work and we have both seeded and regular LDA.
# seeded LDA (repliates https://github.com/koheiw/quanteda.seededlda)
> result10 <- textmodel_lda(dfmt_spnik, verbose = FALSE, seeds = tfmt_spnik)
> terms(result10)
economy politics society diplomacy military nature other
[1,] "company" "parliament" "police" "diplomatic" "army" "human" "going"
[2,] "money" "congress" "school" "embassy" "navy" "sand" "really"
[3,] "market" "politicians" "hospital" "ambassador" "soldiers" "water" "come"
[4,] "bank" "parliamentary" "prison" "treaty" "marine" "syria" "see"
[5,] "industry" "lawmakers" "women" "diplomat" "korea" "syrian" "american"
[6,] "banks" "voters" "man" "diplomats" "korean" "terrorist" "know"
[7,] "markets" "lawmaker" "investigation" "sanctions" "missile" "daesh" "facebook"
[8,] "banking" "politician" "found" "iran" "air" "turkish" "much"
[9,] "china" "uk" "court" "deal" "nuclear" "turkey" "good"
[10,] "chinese" "eu" "children" "meeting" "force" "weapons" "team"
# regular (unseeded) LDA
> result11 <- textmodel_lda(dfmt_spnik, k = 7, verbose = FALSE)
> terms(result11)
topic1 topic2 topic3 topic4 topic5 topic6 topic7
[1,] "korea" "china" "syria" "eu" "going" "uk" "police"
[2,] "korean" "chinese" "syrian" "sanctions" "really" "house" "video"
[3,] "nuclear" "economic" "israel" "iran" "much" "british" "women"
[4,] "missile" "india" "terrorist" "deal" "know" "department" "court"
[5,] "air" "oil" "daesh" "union" "see" "white" "man"
[6,] "nato" "billion" "turkish" "agreement" "come" "campaign" "found"
[7,] "force" "trade" "turkey" "germany" "good" "ukrainian" "children"
[8,] "japan" "project" "weapons" "elections" "something" "secretary" "service"
[9,] "kim" "indian" "saudi" "parliament" "facebook" "ukraine" "swedish"
[10,] "aircraft" "companies" "iraq" "german" "problem" "intelligence" "rights"
My question is should I separate the function to textmodel_lda(x, k)
and textmodel_seededlda(x, dictionary)
just like my older package?
Just my very subjective two cents: I think a dedicated textmodel_seededlda()
function would be good advertisement for the concept as it is not widely known yet.
Which doesn't mean though that textmodel_lda()
shouldn't be able to do it as well. Like stringi::stri_detect()
which runs stringi::stri_detect_fixed()
if one wants to.
@JBGruber thanks for the input. I added textmodel_seededlda()
to make it more visible to users.
Sorry to be a downer here - and I was offline for 2 weeks - but seeded LDA is already available through topicmodels::LDA()
. See https://github.com/quanteda/quanteda.textmodels/pull/31#pullrequestreview-469639444.
topicmodels::LDA
is implemented using this library, which I can call directly via Rcpp:https://sourceforge.net/projects/gibbslda/files/
We can call the library in this way
https://github.com/cran/topicmodels/blob/ade6dc5698f385ad222fd28aa8e90c1a4bd33cf5/R/lda.R#L134-L155
There are a lot of things going on but it shouldn't be too complex for minimal functions that users usually need:
If we implement our quanteda-native LDA, I move quanteda.seededlda to this package.
https://github.com/koheiw/quanteda.seededlda