slycoder / R-lda

Latent Dirichlet allocation package for R
16 stars 23 forks source link

Using model to predict posterior for new documents #14

Closed andrewhill157 closed 5 years ago

andrewhill157 commented 5 years ago

This is likely a naive question as I am very new to topic models in general, but is there a way to use the return values of the lda.collapsed.gibbs.sampler to predict the posterior prob (or something roughly equivalent to document_expects) of being assigned to the set of topics learned on the original documents for a new set of documents? Thanks for any help with recommendations!

Thanks, Andrew

slycoder commented 5 years ago

Hi, check out #10 for an example and some discussion about potential pitfalls.

andrewhill157 commented 5 years ago

Whoops, sorry to have missed (had previously posted on the wrong repo and didn't check again when switched) -- thanks for the help!

On Sun, Oct 21, 2018 at 12:59 AM Jonathan Chang notifications@github.com wrote:

Hi, check out #10 https://github.com/slycoder/R-lda/issues/10 for an example and some discussion about potential pitfalls.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/slycoder/R-lda/issues/14#issuecomment-431647595, or mute the thread https://github.com/notifications/unsubscribe-auth/AEkeA3cabRpciktASFoXppkEa-GXJDNXks5unCldgaJpZM4XlA-p .

alaxn commented 5 years ago

@slycoder Hi, slycoder. Thank you for your answers in the #10. I have another question: How to predict new documents if the new documents are not consistent with the training documents? For example, I build a model based on 300 documents, and now I only have 100 new documents to predict. At this time, I tried to extend the number of new documents by copied themselves for two times. is it OK? Thank you very much.

slycoder commented 5 years ago

Hi @alaxn, there's no reason why the length of the documents list needs to be the same length as the original. Are you running into any issues when passing in the shorter (length 100) set of test documents?

alaxn commented 5 years ago

Thank you @slycoder.

I tied this in the test documents: test_out = lda.collapsed.gibbs.sampler(test_documents, topic_num, vocab, 1000, alpha, eta, initial = list(assignments = out$assignments), freeze.topics = TRUE) And return the error: Error in structure(.Call("collapsedGibbsSampler", documents, as.integer(K), : initial must be a length nd NewList.

As you suggested, I test a shorter (length) set of test documents but return the same error. If I remove the Initial, everything is OK. Code is here: test_out = lda.collapsed.gibbs.sampler(test_documents, topic_num, vocab, 1000, alpha, eta)

BTW, I noticed that if I changed initial parameter to this, the code ran well. test_out = da.collapsed.gibbs.sampler(test_documents, topic_num, vocab, 1000, alpha, eta, initial = list(topics = out$topics, topic_sums = out$topic_sums), freeze.topics = TRUE)

If you need to know more about my code, please tell me. Thank you very much!

slycoder commented 5 years ago

Correct. You probably don't want to set the initial assignments as you did in the first bit of code. In most applications you want to infer topic distributions over new documents conditioned on the learned topics, in which case you only pass topics and topic_sums to initial.

Most of the time you only want to use initial assignments if you plan on interrupting the sampler (e.g. to test convergence).

In any case it sounds like you figured things out!

alaxn commented 5 years ago

Thank you! @slycoder