koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 16 forks source link

Comparing with topicmodels::LDA #28

Closed masa126 closed 2 years ago

masa126 commented 2 years ago

Dear Kohei, I guess the theta matrix of your seededlda::textmodel_lda is same as the gamma matrix of textmodels:LDA. I set the proc parameters such as Gibbs sampling, random seed are same for both. I found significant difference between the per-document-per-topic-probability values. I applied 110 sections of "Kokoro" written by Soseki Natsume as the test data. kokoro_df2-m50-71z.csv

Please find a sample comparing result as an attached png. kokoro_df2-LDA-slda_2k-10tpc-gamma-T28

Will you kindly show me the detail calculation of theta matrix?

Masataka

koheiw commented 2 years ago

It is an interesting experiment. Can you upload the code and the data for me to investigate?

masa126 commented 2 years ago

Sorry to late my response. Please find the input data (DTM) and my R scripts in the below. Kokoro-LDA-TopicRatio.txt kokoro_71w_DTM.csv Regards

koheiw commented 2 years ago

Thanks for the files. A simple answer to your question is that, although topicmodels and seededlda are based on the same C++ code, they are different packages. You selected topics where "K" is at the top to make the plot, but other words are different between them for this reason.

> seededlda::terms(kokoro_LDAs, 10)
      topic1 topic2 topic3   topic4   topic5 topic6 topic7     topic8     topic9 topic10 
 [1,] "見る" "眼"   "聞く"   "先生"   "妻"   "手紙" "奥さん"   "K"       "父"   "叔父"  
 [2,] "行く" "見る" "言葉"   "人"     "自分" "書く" "お嬢さん" "自分"     "母"   "人"    
 [3,] "帰る" "帰る" "思う"   "見える" "思う" "人"   "女"       "二人"     "兄"   "思う"  
 [4,] "前"   "顔"   "知れる" "答える" "死ぬ" "来る" "自分"     "出る"     "病気" "家"    
 [5,] "出る" "頭"   "事"     "解る"   "人間" "読む" "思う"     "お嬢さん" "死ぬ" "東京"  
 [6,] "卒業" "声"   "問題"   "人間"   "心"   "返事" "出る"     "答える"   "東京" "考える"
 [7,] "聞く" "室"   "前"     "二人"   "外"   "自分" "男"       "知る"     "好い" "自分"  
 [8,] "笑う" "来る" "話"     "知る"   "行く" "気"   "考える"   "考える"   "聞く" "解る"  
 [9,] "顔"   "坐る" "口"     "言葉"   "一人" "今"   "今"       "見える"   "口"   "知る"  
[10,] "思う" "手"   "話す"   "態度"   "意味" "出す" "好い"     "立つ"     "知る" "心"    
> 

> topicmodels::terms(kokoro_LDAt, 10)
      Topic 1  Topic 2    Topic 3 Topic 4  Topic 5 Topic 6  Topic 7  Topic 8  Topic 9 Topic 10
 [1,] "奥さん" "K"       "自分"  "聞く"   "出る"  "人"     "今"     "見る"   "父"    "先生"  
 [2,] "女"     "お嬢さん" "思う"  "言葉"   "帰る"  "知る"   "思う"   "前"     "母"    "答える"
 [3,] "見る"   "室"       "妻"    "考える" "来る"  "見える" "叔父"   "眼"     "書く"  "人"    
 [4,] "顔"     "答える"   "死ぬ"  "口"     "立つ"  "自分"   "家"     "顔"     "手紙"  "卒業"  
 [5,] "急"     "声"       "心"    "話"     "宅"    "解る"   "事"     "手"     "兄"    "解る"  
 [6,] "態度"   "付く"     "人間"  "意味"   "笑う"  "心持"   "東京"   "問題"   "出す"  "人間"  
 [7,] "二人"   "坐る"     "一人"  "様子"   "行く"  "頭"     "知れる" "悪い"   "読む"  "外"    
 [8,] "少し"   "取る"     "気"    "返事"   "歩く"  "好い"   "考える" "少し"   "病気"  "手"    
 [9,] "眼"     "心持"     "外"    "気"     "見る"  "二人"   "頭"     "知れる" "卒業"  "少し"  
[10,] "話す"   "聞く"     "帰る"  "二人"   "二人"  "話す"   "心"     "聞く"   "東京"  "一人"  

It is hard to say which is better, but you can check in which sections "K" should appear. I believe the person only appear only late in the story. In that case, the result of seededlda is more correct.

masa126 commented 2 years ago

Thank you for your response. I confirmed the document-term-matrix elements stored in the dtm-format(topicmodels input) and dfm-format(seedelda input) in my R scripts #76-100 and #153-178. These the document-term-matrix elements are same. I specified the random.seed and Gibbs sampling for topicmodels and seededlda as same. If topicmodels and seededlda are based on the same C++ code, the results should be same. Is there any other parameters for topicmodels or seededlda?

koheiw commented 2 years ago

We need to modify C++ code in creating R packages. seededlda and topicmodels use different mechanism for random number generation, for example. I also rewrote the code from GibbsLDA++ entirely to replace arrays with vectors.

You might understand the difference if you compare these two files:

https://github.com/koheiw/seededlda/blob/master/src/lda.h https://github.com/cran/topicmodels/blob/master/src/model.h

masa126 commented 2 years ago

Thank you for your comments. I understand the difference between topicmodels and seededlda is caused by the mechanism for random number generation.