Exploding Perplexity for big number of topics

piskvorky / gensim

Topic Modelling for Humans

https://radimrehurek.com/gensim

GNU Lesser General Public License v2.1

15.55k stars 4.37k forks source link

Exploding Perplexity for big number of topics #2443

Open snollygoster123123 opened 5 years ago

snollygoster123123 commented 5 years ago

I am training LDA on a set of ~17500 Documents. Until 230 Topics, it works perfectly fine, but for everything above that, the perplexity score explodes.

(Perplexity was calucated by taking 2 * (-1.0 lda_model.log_perplexity(corpus)) which results in 234599399490.052. Usually my perplexity is around 70-150.)

For the perplexity, I am not using a hold out set. It is calculated on the same corpus that was used for training. I can upload the lda model + corpus in the next few minutes.

piskvorky commented 5 years ago

@snollygoster123123 I remember other users reporting similar issues, since we switched the LDA default precision from double (float64) to single (float32) in #1656.

Can you try this https://github.com/RaRe-Technologies/gensim/issues/217#issuecomment-435539481 and let me know if that helped in any way?

Also, make sure to include all the necessary info here from the issue template: software versions, steps to reproduce, etc.

menshikh-iv commented 5 years ago

Just for note: I also received very large perplexity value with gensim==3.7.1 (even bigger than @snollygoster123123) with training on super-large corpus (13.5kk documents, 850k dictionary, 0.018% density), but:

I cheked topics manually and they looks fine.
Upstream models, based on LDA vectors, works fine too
I used float64 for training (to avoid potential numerical over/under-flow issue)

piskvorky commented 4 years ago

Ping @snollygoster123123 @menshikh-iv are you able to provide a reproducible example? We'll have a look.

menshikh-iv commented 4 years ago

@piskvorky no I can't (by NDA reasons), sorry. I guess you can try to reproduce that with any large corpus (similar by stats from the previous message)

Alexander-philip-sage commented 2 years ago

having this problem with both sklearn and gensim