Closed masa126 closed 2 years ago
It is an interesting experiment. Can you upload the code and the data for me to investigate?
Sorry to late my response. Please find the input data (DTM) and my R scripts in the below. Kokoro-LDA-TopicRatio.txt kokoro_71w_DTM.csv Regards
Thanks for the files. A simple answer to your question is that, although topicmodels and seededlda are based on the same C++ code, they are different packages. You selected topics where "K" is at the top to make the plot, but other words are different between them for this reason.
> seededlda::terms(kokoro_LDAs, 10)
topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
[1,] "見る" "眼" "聞く" "先生" "妻" "手紙" "奥さん" "K" "父" "叔父"
[2,] "行く" "見る" "言葉" "人" "自分" "書く" "お嬢さん" "自分" "母" "人"
[3,] "帰る" "帰る" "思う" "見える" "思う" "人" "女" "二人" "兄" "思う"
[4,] "前" "顔" "知れる" "答える" "死ぬ" "来る" "自分" "出る" "病気" "家"
[5,] "出る" "頭" "事" "解る" "人間" "読む" "思う" "お嬢さん" "死ぬ" "東京"
[6,] "卒業" "声" "問題" "人間" "心" "返事" "出る" "答える" "東京" "考える"
[7,] "聞く" "室" "前" "二人" "外" "自分" "男" "知る" "好い" "自分"
[8,] "笑う" "来る" "話" "知る" "行く" "気" "考える" "考える" "聞く" "解る"
[9,] "顔" "坐る" "口" "言葉" "一人" "今" "今" "見える" "口" "知る"
[10,] "思う" "手" "話す" "態度" "意味" "出す" "好い" "立つ" "知る" "心"
>
> topicmodels::terms(kokoro_LDAt, 10)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
[1,] "奥さん" "K" "自分" "聞く" "出る" "人" "今" "見る" "父" "先生"
[2,] "女" "お嬢さん" "思う" "言葉" "帰る" "知る" "思う" "前" "母" "答える"
[3,] "見る" "室" "妻" "考える" "来る" "見える" "叔父" "眼" "書く" "人"
[4,] "顔" "答える" "死ぬ" "口" "立つ" "自分" "家" "顔" "手紙" "卒業"
[5,] "急" "声" "心" "話" "宅" "解る" "事" "手" "兄" "解る"
[6,] "態度" "付く" "人間" "意味" "笑う" "心持" "東京" "問題" "出す" "人間"
[7,] "二人" "坐る" "一人" "様子" "行く" "頭" "知れる" "悪い" "読む" "外"
[8,] "少し" "取る" "気" "返事" "歩く" "好い" "考える" "少し" "病気" "手"
[9,] "眼" "心持" "外" "気" "見る" "二人" "頭" "知れる" "卒業" "少し"
[10,] "話す" "聞く" "帰る" "二人" "二人" "話す" "心" "聞く" "東京" "一人"
It is hard to say which is better, but you can check in which sections "K" should appear. I believe the person only appear only late in the story. In that case, the result of seededlda is more correct.
Thank you for your response. I confirmed the document-term-matrix elements stored in the dtm-format(topicmodels input) and dfm-format(seedelda input) in my R scripts #76-100 and #153-178. These the document-term-matrix elements are same. I specified the random.seed and Gibbs sampling for topicmodels and seededlda as same. If topicmodels and seededlda are based on the same C++ code, the results should be same. Is there any other parameters for topicmodels or seededlda?
We need to modify C++ code in creating R packages. seededlda and topicmodels use different mechanism for random number generation, for example. I also rewrote the code from GibbsLDA++ entirely to replace arrays with vectors.
You might understand the difference if you compare these two files:
https://github.com/koheiw/seededlda/blob/master/src/lda.h https://github.com/cran/topicmodels/blob/master/src/model.h
Thank you for your comments. I understand the difference between topicmodels and seededlda is caused by the mechanism for random number generation.
Dear Kohei, I guess the theta matrix of your seededlda::textmodel_lda is same as the gamma matrix of textmodels:LDA. I set the proc parameters such as Gibbs sampling, random seed are same for both. I found significant difference between the per-document-per-topic-probability values. I applied 110 sections of "Kokoro" written by Soseki Natsume as the test data. kokoro_df2-m50-71z.csv
Please find a sample comparing result as an attached png.
Will you kindly show me the detail calculation of theta matrix?
Masataka