deprecation warnings in LDA example code

cartazio commented 6 years ago

hey @bob-carpenter , @lizzyagibson and I have been looking at the lda example code (its nice how it closely maps to the generative description in the LDA journal paper), and theres a few deprecation warnings related to <- along with the target update increment_log_sum_exp expression. you may wanna update them :)

thanks for the lovely examples!

cartazio commented 6 years ago

relatedly, whats the correct / recommend way to rewrite the sum over the gammas?

as written, its increment_log_prob = log_sum_exp(gamma)

should it be

a) target+= gamma b) target+= something something gamma c) something else?

bob-carpenter commented 6 years ago

The code is way out of date. It's in

https://github.com/stan-dev/example-models/blob/ec6d329bb5a88fa53e44c28fa01287701660933c/misc/cluster/lda/lda.stan

The current marginalization over the topic (k) for a given word (n) where is this:

  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K) 
      gamma[k] <- log(theta[doc[n],k]) + log(phi[k,w[n]]);
    increment_log_prob(log_sum_exp(gamma));  // likelihood
  }

That can be reduced to

  for (n in 1:N)
    target += log_sum_exp(log(theta[doc[n]]) + to_vector(log(phi[ , w[n]])));

It'd be even better to define log_phi in vector form and reuse for each n. It would also be worth doing this for log_theta if the number of words per document is greater than the total number of topics.

bob-carpenter commented 6 years ago

@cartazio: Feel free to submit a pull request.

And a warning---you can't really do Bayesian inference for LDA because of the multimodality. You'll see that you won't satisfy convergence diagnostics running in multiple chains, and not just because of label switching.

cartazio commented 6 years ago

@bob-carpenter thanks! Thats super helpful.

by multi-mode you mean: there are different local optima when viewed as an optimization problem / things are nonconvex? (ie vary the priors and there will be different local optima in the posterior?). I had to google around to figure out what you meant, https://scholar.harvard.edu/files/dtingley/files/multimod.pdf seemed the most clearly expositional despite the double spaced formatting :)

is there any good reading/references on how the "variational" formulations such as Mallet/VowpalWabbit etc deal with that issue? or is it just one of those things that tends to stay hidden in folklore common knowledge?

bob-carpenter commented 6 years ago

Yes. I meant local optima by "mode".

Nobody can deal with the issue. It's computationally intractable (at least unless P = NP). Run multiple times with different randomizer, get different answers. Usually it's only used for exploratory data analysis or to generate features for something else, so the multiple answers aren't a big deal---you just choose one either randomly or with human guidance.

Some of the later literature tries to add more informative priors to guide solutions. Some of the early work by Griffiths and Steyvers tried to measure just how different the different modes were that the algorithms found with random inits.

cartazio commented 6 years ago

thanks! i'll have do a bit of digging into this :) intractability is no surprise, i was slightly imagining it might be interesting to look at the topology of how tthe different inits / regions of answers connect

also what does the term label switching mean here?

On Sun, Nov 12, 2017 at 4:58 PM, Bob Carpenter notifications@github.com wrote:

Yes. I meant local optima by "mode".

Nobody can deal with the issue. It's computationally intractable (at least unless P = NP). Run multiple times with different randomizer, get different answers. Usually it's only used for exploratory data analysis or to generate features for something else, so the multiple answers aren't a big deal---you just choose one either randomly or with human guidance.

Some of the later literature tries to add more informative priors to guide solutions. Some of the early work by Griffiths and Steyvers tried to measure just how different the different modes were that the algorithms found with random inits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/example-models/issues/125#issuecomment-343771446, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwgXYqgvwDq7PFaGYwSGNECWYX2CFks5s12n2gaJpZM4Qa_oo .

bob-carpenter commented 6 years ago

On Nov 12, 2017, at 11:01 PM, Carter Tazio Schonwald notifications@github.com wrote:

thanks! i'll have do a bit of digging into this :) intractability is no surprise, i was slightly imagining it might be interesting to look at the topology of how tthe different inits / regions of answers connect

I don't know of any work characterizing this, even for simpler mixtures than LDA.

cartazio commented 6 years ago

Considering that would veer into topological data analysis / computational topology and likely be #P hard, I’m not surprised. :)

What’s the relable stuff you mentioned ?

On Mon, Nov 13, 2017 at 2:17 PM Bob Carpenter notifications@github.com wrote:

On Nov 12, 2017, at 11:01 PM, Carter Tazio Schonwald < notifications@github.com> wrote:

thanks! i'll have do a bit of digging into this :) intractability is no surprise, i was slightly imagining it might be interesting to look at the topology of how tthe different inits / regions of answers connect

I don't know of any work characterizing this, even for simpler mixtures than LDA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/example-models/issues/125#issuecomment-344026803, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwvtcayPF2aO9DPdQrO4snZvpvbxIks5s2JW8gaJpZM4Qa_oo .

bob-carpenter commented 6 years ago

For the Griffiths and Steyvers experiment on relating topics across initializations of LDA:

• Steyvers, Mark and Tom Griffiths. 2007. Probabilistic topic models. In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis and Walter Kintsch (eds.), Handbook of Latent Semantic Analysis. Laurence Erlbaum.

They use a greedy empirical KL-divergence for alignment, which is crude, but useful.

stan-dev / example-models

deprecation warnings in LDA example code #125