piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.44k stars 4.36k forks source link

LSA dimensionality #28

Open piskvorky opened 13 years ago

piskvorky commented 13 years ago

Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:

http://www.springerlink.com/content/500651582r310t05/

This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?

cperreault commented 13 years ago

Hi M. Řehůřek, thank you for opening an issue for automated dimensionality setting. Pierre-Yves Lafleur (pierreyves.lafleur.1@gmail.com) and I (christian.perreault.2@gmail.com) are currently trying to implement MDL with Gensim and test it. In a few days, we should be able to provide an answer in our context (small corpora) to your question "Is MDL robust enough, across several corpora?" Does anyone one ever created a method to automate number of topics setting with LSA?

piskvorky commented 13 years ago

Great, getting rid of an extra parameter in LSA would be really cool!

Also note that @dedan added an easy way to select an LSA submodel (train on K topics, but use only L <= K topics for transformations). It is in commit https://github.com/piskvorky/gensim/commit/7711cbd4f8293b1c2b511ea1afd067938cf60f88 . You simply set the lsa_model.numTopics attribute to some lower number.

tmylk commented 8 years ago

@cperreault Would you still like to add MDL to gensim?

cperreault commented 8 years ago

Yes, I would be interested! I would have to dive again in Gensim. I wrote an automatic number of topics "chooser", based on MDL, in my Master's thesis (http://www.theses.ulaval.ca/2013/29936/ - in French!). It is currently very custom implementation: analyzing a corpus with an increasing k, and storing results in a MySQL DB. If there is interest, I could contribute time and effort to (try to) implement it in a Gensim "native" way.

piskvorky commented 8 years ago

Oh yes, we're still interested. Thanks!

Depending on what the "analysis" entails, we could even make this the default behaviour. There's no need to re-train models with LSA just to tweak k (the topic spaces are conveniently nested, as mentioned above), so hopefully this analysis doesn't take much extra time?

ljdawn commented 7 years ago

@cperreault Hi, can we use the automatic number of topics "chooser" now ?

cperreault commented 7 years ago

@ljdawn @piskvorky. Hi! Sorry for my late response. I did not reworked on it and I don't expect to have time until a few months from now. However, if you think this could be useful, in the short term I can explain the simple principle that guided "my" automatic number of topics (k) chooser. (It is explained in my Master's thesis in French (see link above)) A given corpus is analyzed (LSA) with an increasing k, starting from k=1. For each k, the distribution of (dis)similarities among documents is calculated and rounded at each .1 between 0 and 1. The lowest k that allows most of documents to be considered dissimilar (between 0 and 0.1) is chosen. This is not rocket science, I know. It gave me interesting results among thousands of corpora analyzed, and, at least, a way to automatize analysis and a base of comparison between corpora. What do you think?

tmylk commented 7 years ago

FYI @dsquareindia this is related to your recent work on selecting the number of topics through coherence.

devashishd12 commented 7 years ago

Yes coherence can surely be used as an automatic "chooser" for LSA as well. We could choose the number of topics corresponding to the best coherence from 100 topics. I'll test it out on LSA and get back here. Currently I've only tested it out on LDA but I think with LSA it'll work better.

tmylk commented 7 years ago

It would be nice to compare coherence to the approach of @cperreault

piskvorky commented 7 years ago

LSA topics have no interpretation, so I don't think "coherence" (as in, semantically interpretable topics) makes much sense.