maximtrp / bitermplus

Biterm Topic Model (BTM): modeling topics in short texts
https://bitermplus.readthedocs.io/en/stable/
MIT License
77 stars 13 forks source link

Questions regarding Perplexity and Model Comparison with C++ #16

Closed orpheus92 closed 3 years ago

orpheus92 commented 3 years ago

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?

maximtrp commented 3 years ago

I have seen the cases of calculating perplexity on the training set in the literature, too. But yes, you may leave such a subset for further calculations. When doing this, use topics vs words probabilities matrix from the trained model.

Bitermplus differs a bit from the original implementation: it uses CountVectorizer from sklearn (with default parameters) to preprocess data, random generators from numpy for reproducibility.

orpheus92 commented 3 years ago

Thanks. When you mention to preprocess the data with CountVectorizer, do you mean to convert tokens to vectors that represent the corpus? Indeed this does result to different representations of each document (compared with the c++ code). However, would this still lead to quite different topics (top tokens for each topic) after e.x. 500 iterations? Testing on the test dateset from this repo, it also seems to produce different results from the c++ version.

maximtrp commented 3 years ago

Yes, CountVectorizer computes a matrix of term counts in all documents with some preprocessing (active, by default).

I have just run bitermplus and the original BTM and got pretty close results. Yes, the top words differ a bit in most topics, but this is absolutely normal as topic modeling is partially a stochastic process.

I used SearchSnippets dataset with 8 topics, 2000 iterations, beta = 0.01, alpha = 50/8. Yet, I see that CountVectorizer from sklearn gives a somewhat smaller vocabulary: 4703 vs 4720 (BTM). This may slightly influence the results.