nikita-moor / ldatuning

LDA models parameters tuning
Other
74 stars 18 forks source link

Check needed for NCOL(dtm) <= # of topics #23

Open ko-ichi-h opened 3 years ago

ko-ichi-h commented 3 years ago

Hello,

Thank you for developing such a useful software!

When I run FindTopicsNumber(), I can get results normally for some data, but I get the following error for some data.

fit models... done.
calculate metrics:
     Griffiths2004... done.
     CaoJuan2009... done.
     Arun2010...Error in FUN(X[[i]], ...) : 
     dims [product 71] do not match the length of object [80]
In addition: Warning message:
In cm1/cm2 :
  longer object length is not a multiple of shorter object length

And here is the R script file that gave me the above error: ldatuning_error.zip

If I exclude "Arun2010" from "metrics" option, I get results normally without any errors.

My sessionInfo():

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932   
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldatuning_1.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         xml2_1.3.2         magrittr_2.0.1    
 [4] munsell_0.5.0      colorspace_2.0-2   tm_0.7-8          
 [7] R6_2.5.0           rlang_0.4.11       fansi_0.5.0       
[10] tools_4.1.0        parallel_4.1.0     grid_4.1.0        
[13] gtable_0.3.0       utf8_1.2.2         modeltools_0.2-23 
[16] ellipsis_0.3.2     tibble_3.1.3       lifecycle_1.0.0   
[19] crayon_1.4.1       gmp_0.6-2          NLP_0.2-1         
[22] ggplot2_3.3.5      vctrs_0.3.8        glue_1.4.2        
[25] slam_0.1-48        Rmpfr_0.8-4        compiler_4.1.0    
[28] pillar_1.6.2       topicmodels_0.2-12 scales_1.1.1      
[31] stats4_4.1.0       pkgconfig_2.0.3   

I also get the same error with R 3.x.

Best.

titaniumtroop commented 3 years ago

Since you're getting results with some datasets/metrics and not others, I suspect you may have NAs, NANs, NULL, or other non-numeric values in your data that are causing this type of error. If you confirm the data aren't the issue, it would be helpful if you could post the traceback to pinpoint the error.

Just a note: if memory serves correctly, the original author wrote this package as a grad school project. I took over as the maintainer while working towards my own graduate degree. I'm out of school now so it's been a while since I've actively worked on the project (hence the delayed response), and there isn't any active development going on. If you're interested in contributing to the project, I'm happy to add you to the repo.

Thanks!

ko-ichi-h commented 3 years ago

Hello and thank you for your reply.

I believe the data is not the issue because (1) only "Arun2010" gives me the error while other metrics return results, and (2) for some "topics" settings, "Arun2010" also gives me the result normally. The following command gives me the error but if I delete ", 80" ​from the "topics" option, it gives me the result normally.

result_tps <- FindTopicsNumber(
    dtm,
    topics   = c(seq(2, 35, by=3), 40, 45, 50, 60, 70, 80),
    metrics  = c("Griffiths2004", "CaoJuan2009", "Arun2010" , "Deveaud2014"),
    method   = "Gibbs",
    control  = list(seed = 1234567,  burnin = 1000),
    verbose = T
)

Anyway, here is the traceback() result:

9: FUN(X[[i]], ...)
8: lapply(X = X, FUN = FUN, ...)
7: sapply(models, FUN = function(model) {
       m1 <- exp(model@beta)
       m1.svd <- svd(m1)
       cm1 <- as.matrix(m1.svd$d)
       m2 <- model@gamma
       cm2 <- len %*% m2
       norm <- norm(as.matrix(len), type = "m")
       cm2 <- as.vector(cm2/norm)
       divergence <- sum(cm1 * log(cm1/cm2)) + sum(cm2 * log(cm2/cm1))
       return(divergence)
   })
6: Arun2010(models, dtm)
5: FindTopicsNumber(dtm, topics = c(seq(2, 35, by = 3), 40, 45, 
       50, 60, 70, 80), metrics = c("Griffiths2004", "CaoJuan2009", 
       "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 1234567, 
       burnin = 1000), verbose = T) at ldatuning_error.r#1230
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("C:\\Users\\KO-ichi\\Desktop\\ldatuning_error.r")

Any help would be highly appreciated.

Thank you.

titaniumtroop commented 3 years ago

What are the dimensions of your input dtm? It looks like the number of columns might be 71, which would correspond to the number of terms. Perhaps you can't generate a larger number of topics than you have terms using the Arun method.

If that's the case, there should be a check to confirm that the number of topics specified in FindTopicsNumber doesn't exceed the number of terms in the dtm.

ko-ichi-h commented 3 years ago

What are the dimensions of your input dtm? It looks like the number of columns might be 71, which would correspond to the number of terms. Perhaps you can't generate a larger number of topics than you have terms using the Arun method.

Yes, you are absolutely right. The column number is 71 and svd() outputs only 71 singular values. It causes the error.

And yes again, that number check should be performed and more human readable error message would be nice.

titaniumtroop commented 3 years ago

Ok, glad we were able to identify the issue. I tagged this as something that needs work.

I question whether it ever makes sense to have more topics than terms. My suggestion would be for the check to throw an error if topics > terms, regardless of which algorithm is selected, unless someone can give a good example of why you'd want to have more topics than terms.

The error should occur before actual processing begins -- it wouldn't be fun for your processing to run for a few days only to get an error at the end.

ko-ichi-h commented 3 years ago

Hmm, it may be possible that term A forms topic Alpha, term B forms topic Beta, and term A & B together form topic Gamma. 2 terms and 3 topics may be possible I think.

So it would be fine to raise an error only when users specify "Arun2010".