lda-project / lda

Topic modeling with latent Dirichlet allocation using Gibbs sampling
https://lda.readthedocs.io/
Mozilla Public License 2.0
1.24k stars 390 forks source link

Implement the `transform` method #17

Closed ariddell closed 9 years ago

ariddell commented 10 years ago

It should probably take an n_iter as an argument.

ariddell commented 10 years ago

I think the best way to do this is just to use Buntine's LR sampler to sample z assignments for the new documents. The code for this exists (in horizont).

ariddell commented 9 years ago

Making progress on this.

mdoulaty commented 9 years ago

Hi I was just wondering if there is any progress with the transform method implementation? Thanks for the good work!

ariddell commented 9 years ago

Yes, I'm working on it. I should be able to find some time to finish it this week or next.

mdoulaty commented 9 years ago

ok, cool, that'd be great! happy to test it (against another implementation of mine)

matthiasmauch commented 9 years ago

That'd be awesome!

I literally found your code today and was very happy about it -- then encountered the NotImplementedError ...

I'm planning on switching from R to Python, and I think it's fantastic that you use the scikit learn framework! Good work.

I was going to offer to help, but I'm a relative newbie to both serious Python and Gibbs sampling, and so I'm glad you've taken it on. Like mrtdoulaty I'm happy to test though.

ariddell commented 9 years ago

A first draft is done (very much unoptimized). I'd welcome any testing. It's on the feature branch feature/transform-iterated-pseudo-counts @mrtdoulaty @matthiasmauch @fbkarsdorp

ariddell commented 9 years ago

As an aside, the unbiased estimation method (Buntine's LR sampler) is far too slow for practical use. The iterated pseudo-count method is biased but it's not too bad according to the Buntine paper.

matthiasmauch commented 9 years ago

Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)

matthiasmauch commented 9 years ago

And another question: I've previously used the R implementation in the "topicmodels" package. The estimation method there defaults to variational EM, which I used (btw: for this paper http://arxiv.org/abs/1502.05417).

Do you have an opinion on the difference or relative merits of VEM vs Gibbs sampling?

ariddell commented 9 years ago

Well, it's complicated. There's a lot of literature on the subject. The short answer is that there are more theoretical guarantees that MCMC (e.g., Gibbs) will (eventually) approximate the posterior distribution. VEM is clearly faster but there is no way to bound the approximation error. If you have a small(ish) dataset and you monitor convergence, you should be using MCMC and not VEM (IMHO).

On 03/11, matthiasmauch wrote:

And another question: I've previously used the R implementation in the "topicmodels" package. The estimation method there defaults to variational EM, which I used (btw: for this paper http://arxiv.org/abs/1502.05417).

Do you have an opinion on the difference or relative merits of VEM vs Gibbs sampling?


Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78255316

ariddell commented 9 years ago

Do the results look plausible?

I'll write up some documentation for this and then make a release.

On 03/11, matthiasmauch wrote:

Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)


Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78253942

mdoulaty commented 9 years ago

Thanks for the update, I tested as well and looks to be working fine (actually I'm not using it for text, so my words are not meaningful, I did a topic distribution similarity test from what I got from my code with this and they were reasonably close)

matthiasmauch commented 9 years ago

Yeah, I think quite plausible, but sadly my �words� don�t have literal meanings (in this case they are melodic features), so I can�t glance at them to see whether they fit together, I�m afraid.

It�ll probably take a couple of days to check whether the �meaning� makes sense (this is music from different places in the world, so we should see them group somehow � though we don�t know how).

Matthias

On 11 Mar 2015, at 14:16, Allen Riddell notifications@github.com wrote:

Do the results look plausible?

I'll write up some documentation for this and then make a release.

On 03/11, matthiasmauch wrote:

Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)


Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78253942 � Reply to this email directly or view it on GitHub.