vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
499 stars 110 forks source link

Incremental learning #16

Open cassiehkx opened 6 years ago

cassiehkx commented 6 years ago

It worked very well on small dataset. Can it be improved to enable incremental learning in case of a huge dataset?

vi3k6i5 commented 6 years ago

@cassiehkx can you share link to some other form of LDA which has incremental learning or any other library?

Basically I want to understand how that works so I can think of an approach to do it.

Thanks :)

cassiehkx commented 6 years ago

I haven't got useful suggestions in this case to enable incremental learning. But the problem lies in your input as an nparray, which makes the program unable to expand to a large amount of data. Incremental learning is just one solution I could think of at the moment. Maybe we could change the input data into a sparse matrix? but in that case the matrix multiplication in the loglikelihood would be a problem. what would you recommend?

vi3k6i5 commented 6 years ago

Fair point, switching to sparse matrix should be easier in comparison to incremental learning.

cassiehkx commented 6 years ago

Then another question is the difference between your code and the original scikit-learn LDA code where the eta parameter can control the initialization weight. The paper you were referring to described a more sophisticated method, while your code seems to only set a higher weight for the seed words in the beginning and does not do much during the loglikelihood calculation part. Then what would be the difference comparing to just setting an initialized eta matrix with seed words?

tmerrittsmith commented 6 years ago

In the Gensim implementation of LDA, you can set chunk size to learn incrementally I think?