Closed ariddell closed 9 years ago
I think the best way to do this is just to use Buntine's LR sampler to sample z
assignments for the new documents. The code for this exists (in horizont).
Making progress on this.
Hi I was just wondering if there is any progress with the transform method implementation? Thanks for the good work!
Yes, I'm working on it. I should be able to find some time to finish it this week or next.
ok, cool, that'd be great! happy to test it (against another implementation of mine)
That'd be awesome!
I literally found your code today and was very happy about it -- then encountered the NotImplementedError ...
I'm planning on switching from R to Python, and I think it's fantastic that you use the scikit learn framework! Good work.
I was going to offer to help, but I'm a relative newbie to both serious Python and Gibbs sampling, and so I'm glad you've taken it on. Like mrtdoulaty I'm happy to test though.
A first draft is done (very much unoptimized). I'd welcome any testing. It's on the feature branch feature/transform-iterated-pseudo-counts
@mrtdoulaty @matthiasmauch @fbkarsdorp
As an aside, the unbiased estimation method (Buntine's LR sampler) is far too slow for practical use. The iterated pseudo-count method is biased but it's not too bad according to the Buntine paper.
Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)
And another question: I've previously used the R implementation in the "topicmodels" package. The estimation method there defaults to variational EM, which I used (btw: for this paper http://arxiv.org/abs/1502.05417).
Do you have an opinion on the difference or relative merits of VEM vs Gibbs sampling?
Well, it's complicated. There's a lot of literature on the subject. The short answer is that there are more theoretical guarantees that MCMC (e.g., Gibbs) will (eventually) approximate the posterior distribution. VEM is clearly faster but there is no way to bound the approximation error. If you have a small(ish) dataset and you monitor convergence, you should be using MCMC and not VEM (IMHO).
On 03/11, matthiasmauch wrote:
And another question: I've previously used the R implementation in the "topicmodels" package. The estimation method there defaults to variational EM, which I used (btw: for this paper http://arxiv.org/abs/1502.05417).
Do you have an opinion on the difference or relative merits of VEM vs Gibbs sampling?
Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78255316
Do the results look plausible?
I'll write up some documentation for this and then make a release.
On 03/11, matthiasmauch wrote:
Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)
Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78253942
Thanks for the update, I tested as well and looks to be working fine (actually I'm not using it for text, so my words are not meaningful, I did a topic distribution similarity test from what I got from my code with this and they were reasonably close)
Yeah, I think quite plausible, but sadly my �words� don�t have literal meanings (in this case they are melodic features), so I can�t glance at them to see whether they fit together, I�m afraid.
It�ll probably take a couple of days to check whether the �meaning� makes sense (this is music from different places in the world, so we should see them group somehow � though we don�t know how).
Matthias
On 11 Mar 2015, at 14:16, Allen Riddell notifications@github.com wrote:
Do the results look plausible?
I'll write up some documentation for this and then make a release.
On 03/11, matthiasmauch wrote:
Hi, ariddell. Thank you. Works for me! Though I can't say whether the results are "correct", there's no error, and they're in the right format :)
Reply to this email directly or view it on GitHub: https://github.com/ariddell/lda/issues/17#issuecomment-78253942 � Reply to this email directly or view it on GitHub.
It should probably take an
n_iter
as an argument.