gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Not getting enough topics #7

Closed cgreenberg closed 5 years ago

cgreenberg commented 7 years ago

I tried running corex_topic with a training matrix of size approx 100,000x10,000. I ran Corex with settings n_hidden=1000, max_iter=1000 but only about 200 of them were non-empty. This could be a symptom of my data, of course (and perhaps there ARE only 200 topics), but are there other parameters that could be tuned to generate way more? Thanks.

gregversteeg commented 7 years ago

Hi Charles, This is an interesting limit that we haven’t explored at all. I think that the issue is that the structure learning part (alpha_i,j which connects words to topics) is a bit aggressive about quickly finding partitions. There’s a softmax function that updates alpha to try to put words/variables in only one group. The variable “self.t” in the code controls how quickly we tune the softmax to a hard max. You can see in this line that we try to increase the magnitude of self.t by multiplying 1.3 at each step.

You could try changing this 1.3 to 1.1 and then increasing max_iter to some larger number like 500 or 1000. self.t = np.where(sa > 1.1, 1.3 * self.t, self.t) -Greg

On Jul 31, 2017, at 11:36 AM, Charles Greenberg notifications@github.com wrote:

I tried running corex_topic with a training matrix of size approx 100,000x10,000. I ran Corex with settings n_hidden=1000, max_iter=1000 but only about 200 of them were non-empty. This could be a symptom of my data, of course (and perhaps there ARE only 200 topics), but are there other parameters that could be tuned to generate way more? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gregversteeg/corex_topic/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AH8ph_Rx8_acmDWfD7FiHA9Ey5G80HBVks5sTh68gaJpZM4OowqO.

cgreenberg commented 7 years ago

Thanks @gregversteeg , I tried changing that line to: self.t = np.where(sa > 1.1, 1.1 * self.t, self.t) However, after re-running (with max_iter=1000) I actually got fewer topics, more like 160 instead of 200 before (hard to say if it's due to finding a different local minimum). Do you have some other suggestions? Thanks.

gregversteeg commented 7 years ago

Hi Charles, Sorry this is a bit of uncharted territory. I’m sure it would help to add more features / words, but that might not be feasible. You could also try replacing t = (1 + self.t * np.abs(tcs).reshape((self.n_hidden, 1))) with just “t=1”. I don’t have any other great ideas, and I’m not sure if this will help either.

It could be that if you want such fine-grained clusters a different approach might be better. I would try something like hierarchical clustering (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html ), except using a “precomputed” metric, like the mutual information between features/columns/words.

-Greg

On Aug 2, 2017, at 11:48 AM, Charles Greenberg notifications@github.com wrote:

OK, I tried changing that line to: self.t = np.where(sa > 1.1, 1.1 * self.t, self.t) However, after re-running I actually got fewer topics, more like 160 instead of 200 before (hard to say if it's due to finding a different local minimum). Do you have some other suggestions? Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gregversteeg/corex_topic/issues/7#issuecomment-319763358, or mute the thread https://github.com/notifications/unsubscribe-auth/AH8ph8_-hLLt2_RTUr3U2-V1RHsyo7cdks5sUMSMgaJpZM4OowqO.

devanshuDesai commented 6 years ago

@cgreenberg Were you able to achieve more clusters with t = (1 + self.t * np.abs(tcs).reshape((self.n_hidden, 1)))?