slycoder / R-lda

Latent Dirichlet allocation package for R
16 stars 23 forks source link

Calculation of Perplexity #13

Closed xz6014 closed 6 years ago

xz6014 commented 6 years ago

Hi Jonathan,

Sorry to bother you once again.

I was wondering is there any possible way for me to calculate the perplexity on validation set using the sLDA model trained by the package?

Many thanks,

Xiaohan

slycoder commented 6 years ago

I believe you can just run one iteration of slda.em on your test set with freeze.topics = True, and use the likelihood to compute perplexity.

slycoder commented 6 years ago

See for example how it was done in #10 for a slightly different example.

xz6014 commented 6 years ago

Hi Jonathan,

Thank you for your kind reply. I will have a go at it.

Best,

Xiaohan

Sent from my iPhone

On 29 Aug 2018, at 17:09, Jonathan Chang notifications@github.com<mailto:notifications@github.com> wrote:

See for example how it was done in #10https://github.com/slycoder/R-lda/issues/10 for a slightly different example.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/slycoder/R-lda/issues/13#issuecomment-417008126, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AgQwZRku_mPlvYOfeUEThmZ8SRUbU3CAks5uVryygaJpZM4WQkoC.

xz6014 commented 6 years ago

Hi Jonathan,

I have attempted to solve the problem by following your instructions. However, the perplexity that I obtained are very large. I was wondering whether something has went wrong. Below is my code:

a2 <- matrix(1,nrow=10,ncol=2) a2[,1] <- 2

% 10-fold CV for (i in 1:10){

Training model

slda <- results_df_final2$V2[[i]] score <- results_df_final2$V4[[i]]

slda_valid_idx <- unique(matrix(t(ordering[which(splitfolds==i),]),ncol=1)) valid_docs <- documents[slda_valid_idx]

l <- matrix(1,nrow=1,ncol=length(slda_valid_idx)) for (j in 1:length(slda_valid_idx)){ l[j] <- mean(score[which(ordering==slda_valid_idx[j])]) }

% Obtaining predicted responses for validation data set l <- as.vector(l)

% Counting number of words in validation data N <- 0 for (k in 1:length(slda_valid_idx)){ N <- N + dim(valid_docs[[k]])[2] }

topic_num <- 2 alpha <- 1.0 eta <- 0.1 variance <- 0.5 lambda <- 1.0 e_iter <- 1 m_iter <- 1 fit_slda2 <- slda.em(documents=valid_docs, K=topic_num, vocab=vocab, num.e.iterations=e_iter, num.m.iterations=m_iter, alpha=alpha, eta=eta, annotations = l, params= params[[which(candidate_k==topic_num)]], variance=variance, lambda=lambda, logistic=FALSE, method="sLDA", initial= list(topics = slda$topics, topic_sums = slda$topic_sums), compute.log.likelihood = TRUE, freeze.topics=TRUE) % Perplexity a2[i,2] <- exp(-fit_slda2$log.likelihoods[1]/N) }

Many thanks,

Xiaohan

slycoder commented 6 years ago

You should do many more iterations of the e step.

xz6014 commented 6 years ago

I see. I will have a go.

Many thanks,

Xiaohan