slycoder / R-lda

Latent Dirichlet allocation package for R
16 stars 23 forks source link

Questions about package #6

Closed dselivanov closed 9 years ago

dselivanov commented 9 years ago

Hi, Jonathan! thank you very much for this package, it is far the best I found for R. I realise, that code for this package was written a while ago, but I hope you remember some details=) I have few questions:

  1. What does exactly mean warning in lda.collapsed.gibbs.sampler() function? Is this comment still relevant? Is this mean that each word sampled only once while fitting gibbs sampling? So actually algorithm don't use word counts (all counts assumed to 1?)?

    WARNING: This function does not compute precisely the correct thing when the count associated with a word in a document is not 1 (this is for speed reasons currently). A workaround when a word appears multiple times is to replicate the word across several columns of a document. This will likely be fixed in a future version.

  2. Suppose I fit lda model on large copus, so I have document_sums and topics matrices. Now I want to predict topics for new document (didn't observed). Is it possible? I found this topic and ended with such simplified solution (in R, don't considering speed issues, just proof-of-concepts):
# number of topics
K <- 75
i <- 1

# tokens - word counts for new document
# for example 
# tokens <- c('word_1' = 1, 'word_2' = 1, 'word_3' = 2)

# samples should be equal to number of tokens
new_document_topic_distr <- vector(mode = 'integer', length = sum(tokens))

for (t in names(tokens)) {
  # repeat sampling as many times as new document contains this token 
  for (j in 1:tokens[[t]] {
    # assume we can subset matrix by word (rownames are words)
    probs <- word_topic_matrix[t, ]
    new_document_topic_distr[i] <- sample(x = seq_len(K), size = 1, prob = probs)
    i <- i + 1
  }
}
topic_distr <- tabulate(new_document_topic_distr, nbins = K)

Are you interesting in pull requests? Do you have time for review and maintain package?

slycoder commented 9 years ago
  1. Word count is not ignored; it is in fact used. However, it's not correct in the sense that I just sample once and multiply by the word count to get the contribution to the topics. The more correct thing to do would be to sample from a multinomial but this seems to work ok in practice.
  2. To make predictions about topics for a new document, you can use lda.collapsed.gibbs.sampler again, except
    • Initialize the topics by setting initial=list(topics=your_model$topics, topic_sums=your_model$topic_sums).
    • Set freeze.topics to TRUE so that the topics will be treated as fixed and not updated.
  3. Definitely interested in pull requests! Happy to update the package as needed.
dselivanov commented 9 years ago

@slycoder thank you very much for such detailed answer! Seems I tried to reinvent the wheel for second question =) I believe I can set small number of iterations for prediction (5-10-20) as suggested here? Is it enough in practice?

slycoder commented 9 years ago

That will might be enough but it can't hurt to run for more (to at least give an idea of how close things are).

dselivanov commented 9 years ago

many thanks, closing this.

dselivanov commented 8 years ago

@slycoder FYI based on your code and rewrote vanilla LDA with Rcpp (R's C interface is too verbose...) - https://github.com/dselivanov/text2vec/blob/0.4/src/gibbs.cpp#L18 Surprisingly it happened that it is about 1.5-2x faster (removed a lot of if conditions and sLDA related stuff)...

slycoder commented 8 years ago

Thanks for pointing that out. It's curious because I would've thought that the CPU would've been able to branch predict away most of if statements. I'll have to dig further.

slycoder commented 8 years ago

Opened #8 to investigate.