vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
497 stars 109 forks source link

Potential issue #13

Open drd13 opened 6 years ago

drd13 commented 6 years ago

In your readme for the guidedlda module you showed the behaviour of the algorithm on the NYT dataset. I tried running the example code you provided, with the same seeds and parameters, but increasing the lda's number of iterations from 100 to 1000. Doing this I obtained very similar topics for the guided and unguided topics.

The topics were, for the unguided lda Topic 0: company percent market business price sell executive president Topic 1: game play team win season player second victory Topic 2: life play man write woman thing young child Topic 3: building city place area small house water home Topic 4: official state government issue case member public political

and for the guided lda Topic 0: game play team win player season second start victory point Topic 1: company percent market price business sell executive sale buy cost Topic 2: life play man thing woman write book old young world Topic 3: official state government issue case political public states member leader Topic 4: city building police area home house car father live yesterday

These topics are pretty much identical (the ordering of a few words in the topics is different). This suggests, that the algorithm you have implemented, when run to convergence, is identical to the regular lda.

If my understanding is correct, the algorithm described in Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012) is more involved, and requires a change to the generative model and thus to the collapsed gibbs sampling formula. Your algorithm seems to only be using the seed for the initialization.

I was wondering if you could shed some lights on these issues?

vi3k6i5 commented 6 years ago

Sure thing, The example is based on a really small dataset, so it will be hard to see significant difference.

You are right about the initialisation, As described in the blog post I have only manipulated the initialisation and let the LDA algorithm do its magic post that.

Didn't wanted to create topics which don't actually have enough strength to become a topic of it's own.

If you can elaborate more on this part, I will be able to explain better.

If my understanding is correct, the algorithm described in Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012) is more involved, and requires a change to the generative model and thus to the collapsed gibbs sampling formula. Your algorithm seems to only be using the seed for the initialization.

On Sat, 3 Mar 2018 at 00:17 drd13 notifications@github.com wrote:

In your readme for the guidedlda module you showed the behaviour of the algorithm on the NYT dataset. I tried running the example code you provided, with the same seeds and parameters, but increasing the lda's number of iterations from 100 to 1000. Doing this I obtained very similar topics for the guided and unguided topics.

The topics were, for the unguided lda

Topic 0: company percent market business price sell executive president Topic 1: game play team win season player second victory Topic 2: life play man write woman thing young child Topic 3: building city place area small house water home Topic 4: official state government issue case member public political

and for the guided lda

Topic 0: game play team win player season second start victory point Topic 1: company percent market price business sell executive sale buy cost Topic 2: life play man thing woman write book old young world Topic 3: official state government issue case political public states member leader Topic 4: city building police area home house car father live yesterday

These topics are pretty much identical (the ordering of a few words in the topics is different). This suggests, that the algorithm you have implemented, when run to convergence, is identical to the regular lda.

If my understanding is correct, the algorithm described in Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012) is more involved, and requires a change to the generative model and thus to the collapsed gibbs sampling formula. Your algorithm seems to only be using the seed for the initialization.

I was wondering if you could shed some lights on these issues?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwgrigT7IRz_jobsrofg2cM78CxoZks5taZPEgaJpZM4SaVLe .

drd13 commented 6 years ago

I am not expert in latent Dirichlet allocation so I would be curious to know if you agree with my interpretation.

It seems to me, from your README file, that you are treating the algorithm you implemented and the algorithm in the "Incorporating Lexical Priors into Topic Models" as identical algorithms.

The modified algorithm, in the paper, updates the generative process of lda to incorporate the seeded words. Any such modifications to the generative process would lead to modifications of the collapsed gibbs sampling (as the EM maximisation would be different).

In your implementation, as you do not modify the collapsed gibbs sampling, you do not actually modify the generative process of the algorithm. To explain the differences using an analogy, if you considered the lda as an algorithm trying to minimize a function, your implementation would be modifying the starting point (the initialization here) from which the algorithm would try to find the minimum, while the implementation in the paper would be modifying the function being minimized.

If LDA does sometimes converge to local minima then your algorithm could be interesting as a way to steer towards desired local minima. Otherwise it could be useful for quicker convergence. But, if my understanding of LDA is correct, the guiding of the lda in your implementation is considerably weaker that that of the algorithm in the paper.

vi3k6i5 commented 6 years ago

I was reading up on collapsed Gibbs sampling and I would also not call myself an expert. But anyways, collapsed Gibbs sampling is present in a lot of LDA implementations. As much as I know about GuidedLDA code and the base LDA code. They are also built on collapsed Gibbs sampling.

The computation is done with 3 matrices, ndz, nz, nzw. The same approach is explained in collapsed Gibbs sampling..

I might be wrong though. Please let me know if that is the case.

Thanks :)

On Sat 3 Mar, 2018, 20:34 drd13, notifications@github.com wrote:

I am not expert in latent Dirichlet allocation so I would be curious to know if you agree with my interpretation.

It seems to me, from your README file, that you are treating the algorithm you implemented and the algorithm in the "Incorporating Lexical Priors into Topic Models" as identical algorithms.

The modified algorithm, in the paper, updates the generative process of lda to incorporate the seeded words. Any such modifications to the generative process would lead to modifications of the collapsed gibbs sampling (as the EM maximisation would be different).

In your implementation, as you do not modify the collapsed gibbs sampling, you do not actually modify the generative process of the algorithm. To explain the differences using an analogy, if you considered the lda as an algorithm trying to minimize a function, your implementation would be modifying the starting point (the initialization here) from which the algorithm would try to find the minimum, while the implementation in the paper would be modifying the function being minimized.

If LDA does sometimes converge to local minima then your algorithm could be interesting as a way to steer towards desired local minima. Otherwise it could be useful for quicker convergence. But, if my understanding of LDA is correct, the guiding of the lda in your implementation is considerably weaker that that of the algorithm in the paper.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-370153902, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwu1FlixKPJNyDPoFskU9TzTeSbBqks5tarEQgaJpZM4SaVLe .

drd13 commented 6 years ago

The issue is that the mathematical expression for the collapsed gibbs sampling is dependent on the generative process of the data. This can easily seen from the repository associated to the paper which has a different expression for the gibbs sampling https://github.com/bsou/cl2_project/tree/master/SeededLDA .

vi3k6i5 commented 6 years ago

Yes true. The approach used in both cases are different. We did our own approach.

Hence I said it's based on the paper, but not a complete implementation.

Please share any information that I can read to learn better about this. I haven't worked on this project in a long time. And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future. They have a present but probably not for long.

Happy to improve on this project in spare time though. Please do share whatever information you have on the other approach.

On Wed 7 Mar, 2018, 16:47 drd13, notifications@github.com wrote:

The issue is that the mathematical expression for the collapsed gibbs sampling is dependent on the generative process of the data. This can easily seen from the repository associated to the paper which has a different expression for the gibbs sampling https://github.com/bsou/cl2_project/tree/master/SeededLDA .

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-371107000, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwi_FPIaEz3K06m37lXbE-Ngarf6Lks5tb8HSgaJpZM4SaVLe .

dldx commented 6 years ago

And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future.

Is there a neural net alternative to LDA? I haven't found anything comparable to unsupervised topic modelling from the deep learning community but perhaps I'm missing something.

vi3k6i5 commented 6 years ago

Only labelling the data manually and building a supervised classifier. That's what I know. There is a word2vecLDA but I don't think that project is maintained anymore.

On Fri 23 Mar, 2018, 18:28 dldx, notifications@github.com wrote:

And honestly all the deep learning projects are working so much better that I don't think these projects will have much of a future.

Is there a neural net alternative to LDA? I haven't found anything comparable to unsupervised topic modelling from the deep learning community but perhaps I'm missing something.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/13#issuecomment-375656802, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwrMNqP3LJb-KyglPTDRQfSIsl1JPks5thPFcgaJpZM4SaVLe .

dldx commented 6 years ago

Ah, okay. That's what I thought :) Thanks for the reply!