Open snollygoster123123 opened 5 years ago
@snollygoster123123 I remember other users reporting similar issues, since we switched the LDA default precision from double (float64) to single (float32) in #1656.
Can you try this https://github.com/RaRe-Technologies/gensim/issues/217#issuecomment-435539481 and let me know if that helped in any way?
Also, make sure to include all the necessary info here from the issue template: software versions, steps to reproduce, etc.
Just for note: I also received very large perplexity value with gensim==3.7.1
(even bigger than @snollygoster123123) with training on super-large corpus (13.5kk documents, 850k dictionary, 0.018% density), but:
float64
for training (to avoid potential numerical over/under-flow issue) Ping @snollygoster123123 @menshikh-iv are you able to provide a reproducible example? We'll have a look.
@piskvorky no I can't (by NDA reasons), sorry. I guess you can try to reproduce that with any large corpus (similar by stats from the previous message)
having this problem with both sklearn and gensim
I am training LDA on a set of ~17500 Documents. Until 230 Topics, it works perfectly fine, but for everything above that, the perplexity score explodes.
(Perplexity was calucated by taking 2 * (-1.0 lda_model.log_perplexity(corpus)) which results in 234599399490.052. Usually my perplexity is around 70-150.)
For the perplexity, I am not using a hold out set. It is calculated on the same corpus that was used for training. I can upload the lda model + corpus in the next few minutes.