gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Topic Word Shifts #11

Closed ryanjgallagher closed 5 years ago

ryanjgallagher commented 6 years ago

The total correlation of each topic (and overall) is a function of additive contributions from each word. So, given a trained CorEx topic model and two different sets of documents, the difference in TC (either per topic or overall), can be written as a ranked list in terms of which words contribute most to that difference (see attached document). This yields an interpretable measure of how topical information differs between different sets of documents, and allows us to make careful qualitative results when comparing documents, particularly out-of-sample documents.

get_topic_word_shift(X1, X2, topic_n=None) Input X1, X2: doc-term matrices of shapes n_docs1 x n_words, n_docs2 x n_words, where the columns of each matrix correspond to the same words as the original doc-term used to train the CorEx topic model topic_n: either None or an integer. In the case of None, returns the ranked list of words contributing to the difference across all topics. In the case of an integer, specifics which topic specifically to compute the TC difference contributions within. Output word_contributions: list of tuples, (word, normalized contribution). A ranking (from high to low) of how much each word contributes to the difference in TC between X1 and X2

Most of the machinery for this function is already in calculate_latent and normalize_latent. The einsums likely need to just be changed a bit, but I've been struggling to parse the right way of doing it. Any help would be appreciated.

corex_topic_word_shift.pdf

ryanjgallagher commented 6 years ago

I made some progress on understanding how to implement this, but I've hit a roadblock.

Notation: n = # words, d = # docs, m = # topics

In calculate_latent(), c0 calculates $\sum{i=1}^n \alpha{i,j} \log p(x_i = 0 | y_j = 0) / p(x_i = 0)$. If we change the eisnsum to np.einsum('ji,ij->ij', self.alpha, self.theta[0] - self.lp0) then we can have an n x m matrix of these probabilities per each word. Similarly for c1.

Next, info0 also returns an n x m matrix, where each entry is $\alpha_{i,j} \log p(x_i=1 | y_j=0) p(x_i = 0) / p(x_i = 1) p(x_i = 0 | y_j = 0$. This together with c0 forms part of the sparsity optimization for corex_topic. Similarly for info1.

However, the sparsity optimization also multiplies an indicator variable against each element of info0, which is why we have the X.dot(info0) on line 425. But by doing this, we get a d x m matrix which is then combined with the d x m matrix corresponding to d copies of c0 (the original c0, not our modification above). For the current code, this is good because it allows us to estimate the point-wise TC for each doc, but for the topic word shift we want an n x m matrix of the information content of each word.

The issue I'm facing is that I don't know how to get the n x m matrix. In the math it's simply a matter of commuting sums, but I don't see how to implement it from here. We have (a new) c0 (n x m), which we would like to add to info0 (n x m), but info0 is dotted against X (d x n), making the addition not possible. I think the issue is that the entries of theta collapse probabilities over documents, and this needs to be backed out in order to combine with c0.

Any thoughts on this would be appreciated @gregversteeg (maybe we can discuss this via Skype).

ryanjgallagher commented 5 years ago

Closing for now. Talked about this with @gregversteeg some time ago and some of the math needs to be specified in a little more detail. Should be possible but will need more thinking