piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.7k stars 4.38k forks source link

Held-out set log_perplexity is lower than training set log_perplexity #951

Closed vatsan closed 8 years ago

vatsan commented 8 years ago

I am using the LdaMulticore model in version 0.12.4. My training set and held-out set are roughly of the same size (2500'ish documents). My (filtered) vocabulary size is around 2500 words, with each document containing on the average about 40-50 words.

I am surprised that the log_perplexity() of the held-out set is consistently lower than the log_perplexity() of the training set (doesn't matter what the choice of my alpha, eta & num_topics may be). I am using symmetric priors.

What am I missing? I found some issues talking about running a comparison of the perplexity scores of gensim vs. other LDA implementations, but there haven't been any updates on those tickets. Has anyone else run into a similar situation?

I am pasting some some values of the training set perplexity & the held-out set perplexity for different values of alpha, eta & num_topics below:

num_topics,alpha,eta,log(perplexity_train),log(perplexity_dev) 10,0.5,0.5,-8.05578810641262,-8.153566660372823 20,0.5,0.5,-9.186737152650826,-9.286097266389786 30,0.5,0.5,-10.218488551517328,-10.317789309731793 40,0.5,0.5,-11.190186520987142,-11.284900093515537 50,0.5,0.5,-12.123431638332919,-12.214392917786242 60,0.5,0.5,-13.019738834522023,-13.11235857261153 70,0.5,0.5,-13.8989196959735,-13.99257328306299 80,0.5,0.5,-14.760264790532975,-14.857049031623777 10,0.5,0.1,-7.546771578725856,-7.659529662865446 20,0.5,0.1,-8.120924497200583,-8.25359993286227 30,0.5,0.1,-8.620579027316932,-8.758606302578453 40,0.5,0.1,-9.04628140073807,-9.186449099918537 50,0.5,0.1,-9.436775335665105,-9.578558609978542 60,0.5,0.1,-9.775571799037921,-9.915596663366225 70,0.5,0.1,-10.091845070960177,-10.233599110494001 80,0.5,0.1,-10.38520513965768,-10.524749518386837 10,0.5,0.05,-7.587907037380252,-7.70686734084801 20,0.5,0.05,-8.23313777441332,-8.375780622024195 30,0.5,0.05,-8.744827323542927,-8.89985498027635 40,0.5,0.05,-9.212142167380199,-9.373489620739186 50,0.5,0.05,-9.614209006674834,-9.78531258462419 60,0.5,0.05,-9.971661455009809,-10.14825201495086 70,0.5,0.05,-10.29947707812211,-10.483144957809824 80,0.5,0.05,-10.611826095696227,-10.798550110556791 10,0.5,0.001,-8.460770487765283,-8.592017466587496 20,0.5,0.001,-9.929966041502354,-10.094027612304243 30,0.5,0.001,-11.260223800534899,-11.453057159151603 40,0.5,0.001,-12.529200434015396,-12.752106456930992 50,0.5,0.001,-13.707206712195717,-13.949875495987296 60,0.5,0.001,-14.855015747507343,-15.132441663614582 70,0.5,0.001,-15.976183629644133,-16.286018269723474 80,0.5,0.001,-17.00776504237788,-17.35237967754567 10,0.1,0.5,-7.8459885120031005,-8.029879269855002 20,0.1,0.5,-8.815550152064466,-9.05455119493683 30,0.1,0.5,-9.747054716123388,-10.026883570156237 40,0.1,0.5,-10.656511130527669,-10.982059287314602 50,0.1,0.5,-11.549213885484377,-11.899489959340585 60,0.1,0.5,-12.413738334712365,-12.788089920547769 70,0.1,0.5,-13.284115066405821,-13.68917143530757 80,0.1,0.5,-14.121382051836889,-14.527767049424742 10,0.1,0.1,-7.269756483778563,-7.466747447349475 20,0.1,0.1,-7.613361212836739,-7.896163362340148 30,0.1,0.1,-7.880197914344189,-8.218382152166415 40,0.1,0.1,-8.115458563138109,-8.506379943637805 50,0.1,0.1,-8.353115406513487,-8.788538675150887 60,0.1,0.1,-8.562119684171664,-9.030167390716223 70,0.1,0.1,-8.76051392268029,-9.264152759638048 80,0.1,0.1,-8.962068577159876,-9.505146648330205 10,0.1,0.05,-7.310540914598073,-7.509416154215084 20,0.1,0.05,-7.617314141454936,-7.893225751943659 30,0.1,0.05,-7.8392386634125595,-8.18381205481122 40,0.1,0.05,-8.014973968704291,-8.402075980625863 50,0.1,0.05,-8.202349210074646,-8.628600772187784 60,0.1,0.05,-8.33533204628265,-8.801435825387799 70,0.1,0.05,-8.493404342011992,-9.006494994757754 80,0.1,0.05,-8.629092388314504,-9.169865916958587 10,0.1,0.001,-8.009643551766533,-8.21590487998099 20,0.1,0.001,-8.659188444300915,-8.950150175036804 30,0.1,0.001,-9.119435635195106,-9.46814935851882 40,0.1,0.001,-9.510929195323932,-9.923511086227062 50,0.1,0.001,-9.758203646990683,-10.208789969775623 60,0.1,0.001,-9.983060574099984,-10.462317616327912 70,0.1,0.001,-10.203330030946255,-10.72305434760764 80,0.1,0.001,-10.387846594980651,-10.940199877809821 10,0.05,0.5,-7.803158250405764,-7.989738513805188 20,0.05,0.5,-8.729797712103538,-8.987241364507133 30,0.05,0.5,-9.639084808237332,-9.959022237556468 40,0.05,0.5,-10.520850815015205,-10.880922887572684 50,0.05,0.5,-11.38901969351194,-11.785409136993884 60,0.05,0.5,-12.232986013750292,-12.645762307635298 70,0.05,0.5,-13.083724850157939,-13.521053564054473 80,0.05,0.5,-13.922457128907348,-14.38801963022701 10,0.05,0.1,-7.23665091071829,-7.4438376508186535 20,0.05,0.1,-7.5304230344119745,-7.8399261966731375 30,0.05,0.1,-7.7761934141079605,-8.155549791398604 40,0.05,0.1,-7.988602311551477,-8.421216208492549 50,0.05,0.1,-8.192731122080302,-8.672900953333569 60,0.05,0.1,-8.377575642736197,-8.90671443808983 70,0.05,0.1,-8.546671831663549,-9.111341584518062 80,0.05,0.1,-8.73126977280086,-9.347771810280282 10,0.05,0.05,-7.258449749896921,-7.461386752753941 20,0.05,0.05,-7.532697422060984,-7.8420602307013345 30,0.05,0.05,-7.7069995004999186,-8.080216289232945 40,0.05,0.05,-7.89441887963802,-8.32730700311155 50,0.05,0.05,-8.040360892481006,-8.537102696439304 60,0.05,0.05,-8.16520248202542,-8.703136709895649 70,0.05,0.05,-8.297589286088215,-8.867396026916559 80,0.05,0.05,-8.401872370206869,-9.013616313739979 10,0.05,0.001,-7.974528360600484,-8.200860230828813 20,0.05,0.001,-8.548426493869393,-8.871270952372466 30,0.05,0.001,-8.998439135137314,-9.393148035092693 40,0.05,0.001,-9.26514224155281,-9.722281942532282 50,0.05,0.001,-9.55422450604299,-10.06778516147727 60,0.05,0.001,-9.752800159641572,-10.299470295813775 70,0.05,0.001,-9.944203488210373,-10.537673530618708 80,0.05,0.001,-10.088491502892598,-10.720840117461275 10,0.001,0.5,-7.798879406654353,-8.057516660440688 20,0.001,0.5,-8.679014675896047,-9.041169208534642 30,0.001,0.5,-9.527448355376011,-9.954024570502943 40,0.001,0.5,-10.357266585034788,-10.841431786841525 50,0.001,0.5,-11.182546589931826,-11.729210849244252 60,0.001,0.5,-11.999062306722962,-12.583356561761265 70,0.001,0.5,-12.803102537902664,-13.443721723835974 80,0.001,0.5,-13.605925144268344,-14.23962272542742 10,0.001,0.1,-7.233095293943708,-7.5216106410221215 20,0.001,0.1,-7.474367061927242,-7.939322341318214 30,0.001,0.1,-7.66968124439146,-8.250047510976268 40,0.001,0.1,-7.841258671830454,-8.514581527960434 50,0.001,0.1,-7.994974909003433,-8.762870226644166 60,0.001,0.1,-8.134994297503324,-8.974140845743852 70,0.001,0.1,-8.297793487864784,-9.194032960052313 80,0.001,0.1,-8.413132728443038,-9.364070753406759 10,0.001,0.05,-7.268858684964996,-7.58012931375355 20,0.001,0.05,-7.472919119770675,-7.955066646644858 30,0.001,0.05,-7.622096220972263,-8.212739246380467 40,0.001,0.05,-7.730523941750177,-8.416233677867803 50,0.001,0.05,-7.836673637104144,-8.616614024249463 60,0.001,0.05,-7.931247802886608,-8.781631355127388 70,0.001,0.05,-8.012320451162473,-8.941774492079993 80,0.001,0.05,-8.077813515165493,-9.058632914913689 10,0.001,0.001,-7.934336556422357,-8.237497152199966 20,0.001,0.001,-8.499487516166436,-8.99050219992496 30,0.001,0.001,-8.85296962045634,-9.467645797223211 40,0.001,0.001,-9.107886651307389,-9.818904416016714 50,0.001,0.001,-9.30956975487339,-10.112652945831346 60,0.001,0.001,-9.471736509932931,-10.34446686996294 70,0.001,0.001,-9.619673272248516,-10.562330745415577 80,0.001,0.001,-9.727799224875568,-10.745974383777272

tmylk commented 8 years ago

Replied on maling list.

vatsan commented 8 years ago

Thanks for responding Lev. Couple of points:

1) I mistook that the function log_perplexity() was actually returning the logarithm of the perplexity. It turns out that the "log" in "log_perplexity" was actually referring to "writing to a log file". By looking at the code snippet here(https://github.com/RaRe-Technologies/gensim/blob/0c5c5ed9024d8ea89e106ebaf926071b4a3a6654/gensim/models/ldamodel.py), it seems like it is actually returning the "per-word-bound" which is in fact the negative of the logarithm of the perplexity (i.e. perplexity = 2^(-bound)). 2) Given item#1, your explanation makes sense, the lower the magnitude of the per-word-bound (whose sign is negative) the lower the perplexity and hence the better the model fit. (ex: -9.72 for training set vs. -10.74 for dev set). 3) However, from the sample values I pasted in my previous email, it now appears like the perplexity is increasing with the increase in number of topics for a given value of alpha & eta. This is counter-intuitive. We'd expect more topics to generally lead to a reduction in perplexity. Other readers have also pointed this out in some other tickets.

Can you clarify? Is this perplexity estimate even reliable? It's messing up model selection. Is Gensim officially recommending we not use perplexity and instead look at the Coherence measures you've pointed out (and thanks for that link, I'll take a look!).

tmylk commented 8 years ago

@vatsan This issue is closed. Please continue discussion on the mailing list