Open piskvorky opened 13 years ago
Hi M. Řehůřek, thank you for opening an issue for automated dimensionality setting. Pierre-Yves Lafleur (pierreyves.lafleur.1@gmail.com) and I (christian.perreault.2@gmail.com) are currently trying to implement MDL with Gensim and test it. In a few days, we should be able to provide an answer in our context (small corpora) to your question "Is MDL robust enough, across several corpora?" Does anyone one ever created a method to automate number of topics setting with LSA?
Great, getting rid of an extra parameter in LSA would be really cool!
Also note that @dedan added an easy way to select an LSA submodel (train on K topics, but use only L <= K topics for transformations). It is in commit https://github.com/piskvorky/gensim/commit/7711cbd4f8293b1c2b511ea1afd067938cf60f88 . You simply set the lsa_model.numTopics
attribute to some lower number.
@cperreault Would you still like to add MDL to gensim?
Yes, I would be interested! I would have to dive again in Gensim. I wrote an automatic number of topics "chooser", based on MDL, in my Master's thesis (http://www.theses.ulaval.ca/2013/29936/ - in French!). It is currently very custom implementation: analyzing a corpus with an increasing k, and storing results in a MySQL DB. If there is interest, I could contribute time and effort to (try to) implement it in a Gensim "native" way.
Oh yes, we're still interested. Thanks!
Depending on what the "analysis" entails, we could even make this the default behaviour. There's no need to re-train models with LSA just to tweak k
(the topic spaces are conveniently nested, as mentioned above), so hopefully this analysis doesn't take much extra time?
@cperreault Hi, can we use the automatic number of topics "chooser" now ?
@ljdawn @piskvorky. Hi! Sorry for my late response. I did not reworked on it and I don't expect to have time until a few months from now. However, if you think this could be useful, in the short term I can explain the simple principle that guided "my" automatic number of topics (k) chooser. (It is explained in my Master's thesis in French (see link above)) A given corpus is analyzed (LSA) with an increasing k, starting from k=1. For each k, the distribution of (dis)similarities among documents is calculated and rounded at each .1 between 0 and 1. The lowest k that allows most of documents to be considered dissimilar (between 0 and 0.1) is chosen. This is not rocket science, I know. It gave me interesting results among thousands of corpora analyzed, and, at least, a way to automatize analysis and a base of comparison between corpora. What do you think?
FYI @dsquareindia this is related to your recent work on selecting the number of topics through coherence.
Yes coherence can surely be used as an automatic "chooser" for LSA as well. We could choose the number of topics corresponding to the best coherence from 100 topics. I'll test it out on LSA and get back here. Currently I've only tested it out on LDA but I think with LSA it'll work better.
It would be nice to compare coherence to the approach of @cperreault
LSA topics have no interpretation, so I don't think "coherence" (as in, semantically interpretable topics) makes much sense.
Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:
http://www.springerlink.com/content/500651582r310t05/
This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?