Closed orpheus92 closed 3 years ago
I have seen the cases of calculating perplexity on the training set in the literature, too. But yes, you may leave such a subset for further calculations. When doing this, use topics vs words probabilities matrix from the trained model.
Bitermplus differs a bit from the original implementation: it uses CountVectorizer from sklearn
(with default parameters) to preprocess data, random generators from numpy
for reproducibility.
Thanks. When you mention to preprocess the data with CountVectorizer, do you mean to convert tokens to vectors that represent the corpus? Indeed this does result to different representations of each document (compared with the c++ code). However, would this still lead to quite different topics (top tokens for each topic) after e.x. 500 iterations? Testing on the test dateset from this repo, it also seems to produce different results from the c++ version.
Yes, CountVectorizer
computes a matrix of term counts in all documents with some preprocessing (active, by default).
I have just run bitermplus and the original BTM and got pretty close results. Yes, the top words differ a bit in most topics, but this is absolutely normal as topic modeling is partially a stochastic process.
I used SearchSnippets dataset with 8 topics, 2000 iterations, beta = 0.01, alpha = 50/8. Yet, I see that CountVectorizer
from sklearn
gives a somewhat smaller vocabulary: 4703 vs 4720 (BTM). This may slightly influence the results.
I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?