get_topics() returns words that appear to be out of order in terms of MI

jay-reynolds commented 4 years ago

Hi (and thank you for this really cool and interesting work)

I've got a situation where topic words appear to be out of order, except for the first topic.

For example, with 2 anchored topics, the first topic words returned are listed with MI sorted in decreasing order, as expected.

However, for the second topic, MI decreases, but then increases again.

...looking at get_topics() it's not clear how this could happen -- the code looks right, and I'm not aware of any strange issues with np.argsort().

Any ideas what I should check next? Is this expected behavior in certain instances?

ryanjgallagher commented 4 years ago

Hi Jay,

Would you be able to give us an example of that output that you're seeing? In the second topic, are the words with decreasing MI any of your anchored words?

jay-reynolds commented 4 years ago

Hi Ryan,

Yes, it looks like the dip in value occurs with anchor terms. Also: my mistake, it occurs in the first topic as well, now that I'm checking against anchor terms per your suggestion.

Here's an example:

0.15946728764627086 -->IS ANCHOR
0.15608121178133852 -->IS ANCHOR
0.14718074679769955 -->IS ANCHOR
0.11207315135684945 -->IS ANCHOR
0.1023825070982882 -->IS ANCHOR
0.18684972683935855
0.09062135517129219 -->IS ANCHOR
0.08225138258348932 -->IS ANCHOR
0.16363234856253925
0.079581158645221 -->IS ANCHOR
0.07883072040167022 -->IS ANCHOR
0.07735422616663093 -->IS ANCHOR
0.07269341731528123 -->IS ANCHOR
0.07252414171551494 -->IS ANCHOR
0.14396664015145097

Is there anything I should mind when interpreting? I notice that if I scale the anchor values by the anchor_strength, ( 2 here ) things fall into place.

Incidentally, and aside from the original question, my anchor term lists are longer than the examples I've come across -- I'm wondering if at some point in growing the anchor term list I'm abusing the intent of the method.

Thank you!

ryanjgallagher commented 4 years ago

Ah yes, I took a look and I see what's happening. When we sort the words for the topics we (correctly) use both the mutual informations and the alphas (the anchor strengths). But when they're printed out, only the mutual informations are shown.

@gregversteeg Should I change the code so that it shows the alpha * MI for each word? Or just keep it as MI?

For your other question, I don't think it's necessarily abusing the method; we often use less anchor words in our examples so that it's clear what's happening. If you want to use longer anchor word lists (like in the 13 you have above for example), I'd just keep two things in mind.

An anchor strength of alpha = 2 means that the topic model should give twice as much weight to the MI of an anchor word than any other word. So the more anchors that you use, the lower you should set alpha (still greater than 1 though), so that your topic model isn't only capturing information about your anchor words and not as much about other topic words.
Related, if you have a lot of anchors, and you want to look at the top words for each topic, you should probably look at more than the top 10 words that are shown by default (which I think you're already doing), since it's easier for anchor words to be top words (for the reason in point 1).

jay-reynolds commented 4 years ago

Makes sense, thank you for the explanation and insight.

Your items 1) & 2) agree with and explain my experience and sense of working with the increasing addition of anchor terms I mentioned earlier. ( I'll take these observations to issue #16 you filed regarding more flexible setting of anchoring terms, as I may have (or make) time to work on this ).

In the course of experimentation with anchor terms yesterday evening, I encountered another case that I think fits this thread. In some instances, negative values are returned. I've been reading up, but still quite uncertain how to interpret this.

Here's an excerpt:

TOPIC 5
0.6752522308191555 -->IS ANCHOR
0.4987313873619939 -->IS ANCHOR
0.4987313873619939 -->IS ANCHOR
-0.059820716153193426 -->IS ANCHOR
0.03969236440544513 -->IS ANCHOR
0.01840579018783036
0.017349679888677933
0.014188547657335066
0.011645922876488609
0.010688759072916202

TOPIC 6
0.5663189370326036 -->IS ANCHOR
0.5663189370326036 -->IS ANCHOR
0.43218635121148813 -->IS ANCHOR
0.083498998024938 -->IS ANCHOR
-0.033314786001531815
-0.006592784292562272 -->IS ANCHOR
-0.032928230504389325
-0.028302226929539296
-0.028302226929539296
-0.02767390215655369

ryanjgallagher commented 4 years ago

Right, so MI is never negative. Here the "negative" MI actually means that the absence of a word is what is informative in that topic. @gregversteeg may have more to say on what that means conceptually.

I've gone back and forth about whether "negative" MI is a good way of indicating this and I'm thinking maybe not, now that it seems like someone other than me is looking closely at them. The absence of the word still has positive MI, so it's the absolute value of what's shown. We used to indicate a word was informative via its absence by appending "~" in front of the word. Would that help more with the code you're running?

jay-reynolds commented 4 years ago

Ah, that's intriguing.

I can work with the negative value on my end, no problem. It's probably more convenient for me to use as-is, now that it's more clear what the intent is.

As for the ~ , maybe a flag for get_topics() to use it or not would be handy.

gregversteeg commented 4 years ago

Sorry to be late to the discussion. Thanks for all good points @ryanjgallagher. alpha * mi is a little more abstract, hard to describe as a ranking system. Even though it has been a more useful way to rank in some cases.

MI is a reasonable solution. Another possibility would be to just expose some sort of correlation measure (which could be positive or negative), and use that when needed. In both cases, I think it's better not to change the existing code much, but at most to add an attribute for correlation and "weighted MI" for ranking.

jay-reynolds commented 4 years ago

Continuing investigation in the spirit of this thread, I've stumbled onto a case with all anchor terms being informative by their absence.

It was surprising at first, so figured I'd share from a user-perspective.

In this instance, the domain-expert I'm working with put together a list of anchor terms that represented specific variants of a phenomenon. As a list, it's conceptually coherent.

However, my corpus doesn't ever discuss these variants inclusively in any document -- they aren't compared, contrasted, nor are various instances referred to incidentally...

Rather, all of the variant words exhibit very strong spatial location/area dependency, such that a given document describing a single location/area on the planet will only mention a single variant.

To check my assumption as to what was going on, using each of the variant terms as individual topic anchors produced expected results: the topics agreed with user-intent.

ryanjgallagher commented 3 years ago

I've made changes to how the topics are returned to address the confusions that came up here.

a213671: the mutual information is now always returned as positive, and the sign of the word indicating whether the presence or absence of a word is informative about a topic is now an output of get_topics() too. So CorEx will return the word, its MI, and its sign as a 3-tuple rather than just a word and its MI as a 2-tuple
a355ab8: there is now an option of weighted_rank, which is what CorEx had been doing under the hood already. But now CorEx will output alpha*MI when weighted_rank=True rather than always returning just MI

gregversteeg / corex_topic

get_topics() returns words that appear to be out of order in terms of MI #35