quanteda / quanteda.textmodels

Text scaling and classification models for quanteda
42 stars 6 forks source link

Add native textmodel_lda #30

Open koheiw opened 4 years ago

koheiw commented 4 years ago

topicmodels::LDA is implemented using this library, which I can call directly via Rcpp:

https://sourceforge.net/projects/gibbslda/files/

We can call the library in this way

https://github.com/cran/topicmodels/blob/ade6dc5698f385ad222fd28aa8e90c1a4bd33cf5/R/lda.R#L134-L155

There are a lot of things going on but it shouldn't be too complex for minimal functions that users usually need:

If we implement our quanteda-native LDA, I move quanteda.seededlda to this package.

https://github.com/koheiw/quanteda.seededlda

koheiw commented 4 years ago

GibbsLDA++-0.2.tar.gz

koheiw commented 4 years ago

I manage to make GibbsLDA++ work and we have both seeded and regular LDA.

# seeded LDA (repliates https://github.com/koheiw/quanteda.seededlda)

> result10 <- textmodel_lda(dfmt_spnik, verbose = FALSE, seeds = tfmt_spnik)
> terms(result10)
      economy    politics        society         diplomacy    military   nature      other     
 [1,] "company"  "parliament"    "police"        "diplomatic" "army"     "human"     "going"   
 [2,] "money"    "congress"      "school"        "embassy"    "navy"     "sand"      "really"  
 [3,] "market"   "politicians"   "hospital"      "ambassador" "soldiers" "water"     "come"    
 [4,] "bank"     "parliamentary" "prison"        "treaty"     "marine"   "syria"     "see"     
 [5,] "industry" "lawmakers"     "women"         "diplomat"   "korea"    "syrian"    "american"
 [6,] "banks"    "voters"        "man"           "diplomats"  "korean"   "terrorist" "know"    
 [7,] "markets"  "lawmaker"      "investigation" "sanctions"  "missile"  "daesh"     "facebook"
 [8,] "banking"  "politician"    "found"         "iran"       "air"      "turkish"   "much"    
 [9,] "china"    "uk"            "court"         "deal"       "nuclear"  "turkey"    "good"    
[10,] "chinese"  "eu"            "children"      "meeting"    "force"    "weapons"   "team"  

# regular (unseeded) LDA
> result11 <- textmodel_lda(dfmt_spnik, k = 7, verbose = FALSE)
> terms(result11)
      topic1     topic2      topic3      topic4       topic5      topic6         topic7    
 [1,] "korea"    "china"     "syria"     "eu"         "going"     "uk"           "police"  
 [2,] "korean"   "chinese"   "syrian"    "sanctions"  "really"    "house"        "video"   
 [3,] "nuclear"  "economic"  "israel"    "iran"       "much"      "british"      "women"   
 [4,] "missile"  "india"     "terrorist" "deal"       "know"      "department"   "court"   
 [5,] "air"      "oil"       "daesh"     "union"      "see"       "white"        "man"     
 [6,] "nato"     "billion"   "turkish"   "agreement"  "come"      "campaign"     "found"   
 [7,] "force"    "trade"     "turkey"    "germany"    "good"      "ukrainian"    "children"
 [8,] "japan"    "project"   "weapons"   "elections"  "something" "secretary"    "service" 
 [9,] "kim"      "indian"    "saudi"     "parliament" "facebook"  "ukraine"      "swedish" 
[10,] "aircraft" "companies" "iraq"      "german"     "problem"   "intelligence" "rights" 

My question is should I separate the function to textmodel_lda(x, k) and textmodel_seededlda(x, dictionary) just like my older package?

JBGruber commented 4 years ago

Just my very subjective two cents: I think a dedicated textmodel_seededlda() function would be good advertisement for the concept as it is not widely known yet.

Which doesn't mean though that textmodel_lda() shouldn't be able to do it as well. Like stringi::stri_detect() which runs stringi::stri_detect_fixed() if one wants to.

koheiw commented 4 years ago

@JBGruber thanks for the input. I added textmodel_seededlda() to make it more visible to users.

kbenoit commented 4 years ago

Sorry to be a downer here - and I was offline for 2 weeks - but seeded LDA is already available through topicmodels::LDA(). See https://github.com/quanteda/quanteda.textmodels/pull/31#pullrequestreview-469639444.