lda-project / lda

Topic modeling with latent Dirichlet allocation using Gibbs sampling
https://lda.readthedocs.io/
Mozilla Public License 2.0
1.24k stars 390 forks source link

_transform_single function implementation #84

Closed luoshao23 closed 6 years ago

luoshao23 commented 6 years ago

I am confused of the update process for the function _transform_single. In my opinion, the PZS_new should be updated using formula as follows

image

In this code, it seems the doc-topic matrix is updated (PZS.sum(axis=0) - PZS + self.alpha) while the word-topic matrix self.components_[:, words].T remains unchanged. Can you explain the mechanism behind the code or give some reference. I have read the paper you mentioned in the README doc. However, I still have difficulty in understanding your code, especially for these following lines:

PZS_new = self.components_[:, words].T  
PZS_new *= (PZS.sum(axis=0) - PZS + self.alpha)
PZS_new /= np.sum(PZS_new, axis=1, keepdims=True)
ariddell commented 6 years ago

(This is line 215-217 of lda.py, right?)

This implements what you see in Equation 4 in Buntine 2009, translated into numpy. Looking at the first two lines of code you included (the last line is just a normalizer):

Looks like I'm missing the citation to the relevant part of this paper: [WMSM09] H.M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In L. Bottou and M. Littman, editors, Proceedings of the 26th International Con- ference on Machine Learning (ICML 2009), 2009.

It's section 4.1 from that paper, Equation 11. (I'm going to create a pull request to update the docstring to include this citation.)

We're using this proposal distribution directly as an approximation of P(z|w) for the new document. This method was chosen because it's very fast and simple.

I hope this helps. Thank you for reviewing this code! It's great to have a second pair of eyes on it.

luoshao23 commented 6 years ago

Thank you for your explanation. That is what I was looking for. I am happy to do the code reviewing. It makes me feel more aware of this algorithm. Also I wish it could help more people to have better understanding when they use this package.