gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Can't extract topics from simple example, unless I add a constant row #37

Closed mdneuzerling closed 3 years ago

mdneuzerling commented 4 years ago

Thank you for this package. I'm keen to explore it further.

I created a simple array to test whether I could extract two topics (unsupervised) where it's clear to the human eye what the two topics should be. My code is below, with 10 documents and 4 words. I find that 9 times out of 10, Corex yields a single topic with all four words, and throws an error because the second topic has 0 words. This is the same if I duplicate the data, or add additional words.

However, if I add a [1, 1, 1, 1] row to the array, interpreted as a document containing every word, things stabilise. Corex will correctly extract the two topics ['tiger', 'bear'] and ['carrot', 'tomato'] every time.

Do you know what might be going on here? From what I can tell, the code initialises with word frequencies, so possibly it struggles with frequencies of 0 in small data sets?

import numpy as np
from corextopic import corextopic as ce

simple_data = np.array(
  [[1, 1, 0, 0],
   [1, 0, 0, 0],
   [0, 1, 0, 0],
   [1, 1, 0, 0],
   [1, 1, 0, 0],
   [0, 0, 1, 1],
   [0, 0, 1, 0],
   [0, 0, 0, 1],
   [0, 0, 1, 1],
   [0, 0, 1, 1]]
)

simple_corex_model = ce.Corex(n_hidden = 2)
simple_corex_model.fit(
  X = simple_data,
  words = ["bear", "tiger", "carrot", "tomato"],
  docs = ["animal", "animal", "animal", "animal", "animal",
          "food", "food", "food", "food", "food"]
)

topics = simple_corex_model.get_topics()
for topic_n, topic in enumerate(topics):
  words,mis = zip(*topic)
  topic_str = str(topic_n+1)+': '+','.join(words)
  print(topic_str)
ryanjgallagher commented 4 years ago

I believe this is expected behavior, though our code should probably catch the error so it exits more elegantly.

This is a case where we have to shift our thinking about what a "topic" is a little bit. In CorEx, a "topic" is a latent random variable that makes words conditionally independent. So, CorEx learns topics that when you condition on them, you've explained all the dependencies between the words. In the case of your documents, "bear" and "tiger" almost always show up exclusively together. Importantly though, we also know that "carrot" and "tomato" do not appear whenever "bear" and "tiger" do. So to explain all the relationships between the words, we only need 1 latent random variable, because whenever our "bear" and "tiger" random variables are "on" (X_bear = 1, X_tiger = 1), we also have that our "carrot" and "tomato" random variables are off (X_carrot = 0, X_tomato = 0), and vice versa.

That's a really high level answer. @gregversteeg can probably explain it a bit more intuitively, I remember running into this same question and he helped me through it.

mdneuzerling commented 4 years ago

That makes a lot of intuitive sense, thank you! I do wonder if adding that dummy row of all 1's would be beneficial in data sets in which we can expect a lot of presence/absence relationships, like in my example above. I'll have to do some more testing.

Thank you for taking the time to explain this.