lejon / PartiallyCollapsedLDA

Implementations of various fast parallelized samplers for LDA, including Partially Collapsed LDA, Light LDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA
26 stars 20 forks source link

Use of P(w) in distinctiveness calculation #14

Closed roger-mahler closed 3 years ago

roger-mahler commented 4 years ago

I'm in the process of implementing a Python-wrapper for PC-LDA to be used in the Welfare state analytics research project. I have a question concerning the distinctiveness calculation here.

The formula according to [Chung et. a.l 2012] is Σ p(t|w) ∙ log(p(t|w) / p(t)), whilst the code implements the formula Σ p(t|w) ∙ log(p(t|w) / p(w)) i.e. word probability is used instead of topic probability. Is this a bug, or am I missing something?

@MansMeg

MansMeg commented 4 years ago

Hi! I need to check this in the paper myself to be sure. Currently, I'm quite busy with paper submission, but later this week I'll check this!

lejon commented 4 years ago

Thanks for the heads up Roger! Will check it out when time permits!

On Tue, Jun 2, 2020 at 8:43 AM Måns Magnusson notifications@github.com wrote:

Hi! I need to check this in the paper myself to be sure. Currently, I'm quite busy with paper submission, but later this week I'll check this!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lejon/PartiallyCollapsedLDA/issues/14#issuecomment-637313506, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA664M5TDJDE4F27XHN4EXLRUSNSLANCNFSM4NPUPLCQ .

MansMeg commented 4 years ago

Yes. It looks like a bug in the code. We should supply p_t() instead of p_w() to the function. I guess this is a real bug if the documentation is correct. But it depends what is supplied to the function if we supply p_t to p_w it would still maybe be correct.

Anyway the notation should be fixed/correct. Also we should check that the function uses the correct p_t().

lejon commented 3 years ago

Again, many thanks @roger-mahler for the heads up. I have now corrected this in commit https://github.com/lejon/PartiallyCollapsedLDA/commit/fd2b1aff9720a4b2218cc2c748132f0ae6f5811f

Cheers!