koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 15 forks source link

Topic-word probabilities not summing to one #60

Closed odelmarcelle closed 1 year ago

odelmarcelle commented 1 year ago

Hello,

I observed a strange behavior when applying the seededLDA model: the topic-word distribution does not always sums to one.

I recently updated the package and I don't remember having this issue before (might be fairly old though).

require(seededlda)
#> Le chargement a nécessité le package : seededlda
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.3.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Le chargement a nécessité le package : proxyC
#> 
#> Attachement du package : 'proxyC'
#> L'objet suivant est masqué depuis 'package:stats':
#> 
#>     dist
#> 
#> Attachement du package : 'seededlda'
#> L'objet suivant est masqué depuis 'package:stats':
#> 
#>     terms
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
  dfm_remove(stopwords('en'), min_nchar = 2) %>%
  dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10)

rowSums(slda$phi)
#>    people     space    moster       war     crime     other 
#> 1.0000000 1.0000000 0.9999004 1.0000000 1.0000000 1.0000000

Created on 2023-06-02 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.0 (2023-04-21 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate French_Belgium.utf8 #> ctype French_Belgium.utf8 #> tz Europe/Paris #> date 2023-06-02 #> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fastmatch 1.1-3 2021-07-23 [1] CRAN (R 4.3.0) #> fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0) #> lattice 0.21-8 2023-04-05 [2] CRAN (R 4.3.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> Matrix 1.5-4 2023-04-04 [2] CRAN (R 4.3.0) #> proxyC * 0.3.3 2022-10-06 [1] CRAN (R 4.3.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0) #> quanteda * 3.3.1 2023-05-18 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0) #> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0) #> D RcppParallel 5.1.7 2023-02-27 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) #> rmarkdown 2.22 2023-06-01 [1] CRAN (R 4.3.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0) #> seededlda * 1.0.0 2023-05-31 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> stopwords 2.3 2021-10-28 [1] CRAN (R 4.3.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0) #> styler 1.10.0 2023-05-24 [1] CRAN (R 4.3.0) #> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0) #> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] C:/Users/odlmarce/AppData/Local/R/win-library/4.3 #> [2] C:/Program Files/R/R-4.3.0/library #> #> D ── DLL MD5 mismatch, broken installation. #> #> ────────────────────────────────────────────────────────────────────────────── ```
koheiw commented 1 year ago

Thank you for this. It looks like a floating point precision problem.

> dict1 <- dictionary(list(people = c("family", "couple", "kids"),
+                         space = c("alien", "planet", "space"),
+                         moster = c("monster*", "ghost*", "zombie*"),
+                         war = c("war", "soldier*", "tanks"),
+                         crime = c("crime*", "murder", "killer")
+                    ))
> slda1 <- textmodel_seededlda(dfmt * 100, dict1, max_iter = 100)
> slda2 <- textmodel_seededlda(dfmt, dict1, max_iter = 100)

> rowSums(slda1$phi)
people  space moster    war  crime 
     1      1      1      1      1 
> rowSums(slda2$phi)
   people     space    moster       war     crime 
1.0000000 1.0000000 0.9998885 1.0000000 1.0000000 
koheiw commented 1 year ago

No, actually rounding problem in the Array object. I will fix it.

koheiw commented 1 year ago

@odelmarcelle can you check if #61 fixed the problem?

odelmarcelle commented 1 year ago

Yes, that solves it for me :smiley: Thanks for the quick response.

odelmarcelle commented 1 year ago

@koheiw Do you plan to release a new version on CRAN soon? This actually causes some tests to fail in a package I created (https://cran.r-project.org/web/checks/check_results_sentopics.html).

koheiw commented 1 year ago

I actually submitted it to the CRAN yesterday.