Closed ryanjgallagher closed 5 years ago
I made some progress on understanding how to implement this, but I've hit a roadblock.
Notation: n = # words, d = # docs, m = # topics
In calculate_latent()
, c0
calculates $\sum{i=1}^n \alpha{i,j} \log p(x_i = 0 | y_j = 0) / p(x_i = 0)$. If we change the eisnsum
to np.einsum('ji,ij->ij', self.alpha, self.theta[0] - self.lp0)
then we can have an n x m matrix of these probabilities per each word. Similarly for c1
.
Next, info0
also returns an n x m matrix, where each entry is $\alpha_{i,j} \log p(x_i=1 | y_j=0) p(x_i = 0) / p(x_i = 1) p(x_i = 0 | y_j = 0$. This together with c0
forms part of the sparsity optimization for corex_topic. Similarly for info1
.
However, the sparsity optimization also multiplies an indicator variable against each element of info0
, which is why we have the X.dot(info0)
on line 425. But by doing this, we get a d x m matrix which is then combined with the d x m matrix corresponding to d copies of c0
(the original c0
, not our modification above). For the current code, this is good because it allows us to estimate the point-wise TC for each doc, but for the topic word shift we want an n x m matrix of the information content of each word.
The issue I'm facing is that I don't know how to get the n x m matrix. In the math it's simply a matter of commuting sums, but I don't see how to implement it from here. We have (a new) c0
(n x m), which we would like to add to info0
(n x m), but info0
is dotted against X
(d x n), making the addition not possible. I think the issue is that the entries of theta
collapse probabilities over documents, and this needs to be backed out in order to combine with c0
.
Any thoughts on this would be appreciated @gregversteeg (maybe we can discuss this via Skype).
Closing for now. Talked about this with @gregversteeg some time ago and some of the math needs to be specified in a little more detail. Should be possible but will need more thinking
The total correlation of each topic (and overall) is a function of additive contributions from each word. So, given a trained CorEx topic model and two different sets of documents, the difference in TC (either per topic or overall), can be written as a ranked list in terms of which words contribute most to that difference (see attached document). This yields an interpretable measure of how topical information differs between different sets of documents, and allows us to make careful qualitative results when comparing documents, particularly out-of-sample documents.
get_topic_word_shift(X1, X2, topic_n=None)
InputX1
,X2
: doc-term matrices of shapesn_docs1
xn_words
,n_docs2
xn_words
, where the columns of each matrix correspond to the same words as the original doc-term used to train the CorEx topic modeltopic_n
: eitherNone
or an integer. In the case ofNone
, returns the ranked list of words contributing to the difference across all topics. In the case of an integer, specifics which topic specifically to compute the TC difference contributions within. Outputword_contributions
: list of tuples, (word, normalized contribution). A ranking (from high to low) of how much each word contributes to the difference in TC between X1 and X2Most of the machinery for this function is already in
calculate_latent
andnormalize_latent
. Theeinsums
likely need to just be changed a bit, but I've been struggling to parse the right way of doing it. Any help would be appreciated.corex_topic_word_shift.pdf