Systemic bug in Stan sampler

stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

2.56k stars 369 forks source link

Systemic bug in Stan sampler #1917

Closed suhailshergill closed 8 years ago

suhailshergill commented 8 years ago

Summary:

Systemic bug in Stan inference engine

Description:

As demonstrated in C-K Hur, A. V. Nori, S. K. Rajamani, and S. Samuel, "A Provably Correct Sampler for Probabilistic Programs" (alternate link) Stan does not seem to converge to the correct distribution for the following model:

1: double x, y;
2: x ~ Gaussian(0, 1);
3: if (x > 0) then
4:    y ~ Gaussian(10, 2);
5: else
6:    y ~ Gamma(3, 3);
7: return y;

Reproducible Steps:

See ipython notebook 1 which demonstrates the steps to reproduce the behaviour with the following stan model:

"""
parameters {
    real x;
    real y;
}
model {
    x ~ normal(0, 1);
    if(x > 0) {
        y ~ normal(10, 2);
    } else {
        y ~ gamma(3, 3);
    }
}
"""

Current Output:

See Example 1 in ipython notebook 2. Also refer to that notebook for an alternate formulation which stan infers correctly.

Expected Output:

See the graphs in ipython notebook 2 and ipython notebook 1

Additional Information:

As mentioned in the description, we believe this points to a systemic flaw in Stan's sampler. The paper provides the general solution as well as examples of other problematic models.

Current Version:

v2.9.0

bob-carpenter commented 8 years ago

Thanks for the clear report. This isn't a bug in Stan so much as a defect in our documentation and parser warnings. If it's a flaw in anything, the responsibility lies with the Hamiltonian Monte Carlo sampler. So there's nothing to fix in Stan, and I'm going to close this issue.

Stan's HMC and NUTS samplers require continuously differentiable density functions because they take the negative log density to be a potential energy function and follow the Hamiltonian of a fictional particle. When you introduce a jump, as in jump at 0 in the Stan program listed in the issue, it defeats the ability to simulate the Hamiltonian. It would fail even more dramatically with a gamma with shape of 1 or less.

We've been meaning to provide a compiler warning to warn users against conditionals that depend on parameters. Other than in spline or other interpolation models, they almost always introduce discontinuities that HMC can't handle.

If you only wanted to recreate the output you're after in Stan, it could be done in the generated quantities block, which really does work as a random number generator.

The model implied by your program is not only unusual in using a continuous variable on which to condition a mixture, but in mixing distributions of different support (gamma requires positive values). Assuming you wanted to focus on positive values, you'd write this mixture prior (even mix of half normal and gamma) in Stan as follows:

transformed data {
  real<lower=0, upper=1> lambda;
  lambda <- 0.5;
}
parameters {
  real<lower=0> theta;
}
model {
  // log p(theta | lambda)
  // = log(lambda * normal(theta | 10, 2)
  //       + (1 - lambda) * gamma(theta | 3, 3)
  increment_log_prob(log_sum_exp(log(lambda) + normal_log(theta, 10, 2),
                                 log1m(lambda) + gamma_log(theta, 3, 3)));
}

But that's going to be problematic to fit due to multimodality with a fairly deep trough between the modes. This makes it hard for the HMC sampler to move around the posterior. If you run the above, you'll find the Markov chains produced by the sampler getting stuck in one of the two modes.

Was there a particular application you had in mind for this kind of mixture?

P.S. I could open the Python notebooks, but not the linked paper. Stan does not follow anything like the approach in the Wingate et al. paper. It uses HMC (specifically NUTS). There are plenty of references in the manual.

syclik commented 8 years ago

Thanks, Bob. I was just about to address this and I'm glad you already did.

@suhailshergill, I tried following the paper, but if you really wanted to "fit" that model, you'd just do:

model {
}
generated quantities {
  real x;
  real y;

  x <- normal_rng(0, 1);
  if (x > 0) {
        y <- normal_rng(10, 2);
    } else {
        y <- gamma_rng(3, 3);
    }
}
}

suhailshergill commented 8 years ago

thanks @bob-carpenter for the prompt response. i'm far from an expert stan user, so i'm not sure how clear or not it is to users that such discontinuities are not supported in the model block, but quite a few individuals in our meetup were taken aback by the silent failure. A compiler warning would certainly help greatly in this regard. Is there an issue regarding that?

What doesn't help is the difference in behaviour. Specifically, while the below (silently) fails:

"""
parameters {
    real x;
    real y;
}
transformed parameters {
    real ret;
    ret <- y;
}
model {
    x ~ normal(0, 1);
    if (x > 0) {
        y ~ normal(10, 2);
    } else {
        y ~ gamma(3, 3);
    }
}
"""

the transformed model (which is seemingly equivalent), works:

"""
parameters {
    real x;
    real y;
    real z;
}
transformed parameters {
    real ret;
    if (x > 0) {
      ret <- y;
    } else {
      ret <- z;
    }
}
model {
    x ~ normal(0, 1);
    y ~ normal(10, 2);
    z ~ gamma(3, 3);
}
"""

Perhaps I need to read the documentation regarding the subtle differences between what's supported in model, transformed parameters and generated quantities blocks.

tscholak commented 8 years ago

@syclik I am the author of one of the Python notebooks. I tried to run your examples, specifically, I tried to run

model {
}
generated quantities {
    real x;
    real y;
    real ret;
    x <- normal_rng(0, 1);
    if (x > 0) {
        y <- normal_rng(10, 2);
    } else {
        y <- gamma_rng(3, 3);
    }
    ret <- y;
}

(don't mind the ret, it's just for further processing in Python). Since the program doesn't have any parameters, I used the "Fixed_param" algorithm with PyStan. However, this crashes. I'm not sure where, I don't see a strack trace on the command line. I'm on Stan 2.9.0.

syclik commented 8 years ago

The fixed_param sampler doesn't work in PyStan v2.9.0. It does work with CmdStan v2.9.0 and RStan v2.9.0.

If you want to see it running, change it to:

parameters {
  real<lower = 0, upper = 1> theta;
}
model { }

generated quantities {
    real x;
    real y;
    real ret;
    x <- normal_rng(0, 1);
    if (x > 0) {
        y <- normal_rng(10, 2);
    } else {
        y <- gamma_rng(3, 3);
    }
    ret <- y;
}

Regarding the iPython notebook, if you've run only one chain, you might see that sort of behavior. If you ran multiple chains, my guess is that R-hat would signal a problem with convergence. (I don't know if that's how you ran it.) Although there isn't a compile-time error, there should be a clear diagnosis that there isn't convergence and the output shouldn't be trusted. It's one of the reasons why we encourage running multiple chains.

On Mon, Jun 13, 2016 at 10:00 PM, Torsten Scholak notifications@github.com wrote:

@syclik https://github.com/syclik I am the author of one of the Python notebooks. I tried to run your examples, specifically, I tried to run

model { } generated quantities { real x; real y; real ret; x <- normal_rng(0, 1); if (x > 0) { y <- normal_rng(10, 2); } else { y <- gamma_rng(3, 3); } ret <- y; }

(don't mind the ret, it's just for further processing in Python). Since the program doesn't have any parameters, I used the "Fixed_param" algorithm with PyStan. However, this crashes. I'm not sure where, I don't see a strack trace on the command line. I'm on Stan 2.9.0.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/1917#issuecomment-225760228, or mute the thread https://github.com/notifications/unsubscribe/AAZ_FwGMWcZXOGOoUkK8bXTOD0UHSPgAks5qLgtVgaJpZM4I02oy .

tscholak commented 8 years ago

@syclik Thanks, I try that then.

I did see problems with convergence. I ended up running 32 chains to get somewhat consistent results.

syclik commented 8 years ago

When you have convergence problems, merely running more chains won't solve the problem. Did running 32 actually get split R-hat down to 1?

Running multiple chains will hopefully tell you if there exists a problem (when you heed the existing convergence diagnostics).

On Jun 13, 2016, at 10:11 PM, Torsten Scholak notifications@github.com wrote:

@syclik Thanks, I try that then.

I did see problems with convergence. I ended up running 32 chains to get somewhat consistent results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tscholak commented 8 years ago

I'm getting an R-hat of 1.

example_1 = """
parameters {
    real x;
    real y;
}
transformed parameters {
    real ret;
    ret <- y;
}
model {
    x ~ normal(0, 1);
    if (x > 0) {
        y ~ normal(10, 2);
    } else {
        y ~ gamma(3, 3);
    }
}
"""
sm = pystan.StanModel(model_code=example_1)
fit = sm.sampling(data={},
                  iter=30000,
                  warmup=15000,
                  chains=32)

and the fit gives:

Inference for Stan model: anon_model_8c6f88b7c6812397267935ab183ea884.
32 chains, each with iter=30000; warmup=15000; thin=1; 
post-warmup draws per chain=15000, total post-warmup draws=480000.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
x      0.74    0.01   0.65-8.0e-3   0.28   0.64   1.12   2.22 2219.0   1.01
y      9.77    0.08   2.43   2.71   8.52   9.93  11.31  13.89  970.0   1.03
ret    9.77    0.08   2.43   2.71   8.52   9.93  11.31  13.89  970.0   1.03
lp__  -1.06    0.02   1.09  -4.02  -1.45  -0.71  -0.29  -0.02 2079.0   1.02

Samples were drawn using NUTS(diag_e) at Mon Jun 13 22:53:29 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

But this is all beside the point. These examples are just pathologic models from that Hur et al. paper, designed to make a point about how branches and loops in probabilistic programming languages can cause problems for MCMC-based inference.

bob-carpenter commented 8 years ago

I created an issue for a compiler warning here in case people didn't notice it in the threaded view: https://github.com/stan-dev/stan/issues/1918

@tscholak Thanks for running the above. I think the point is that the mean of x is suposed to be 0. The densities aren't going to be comparable because Stan drops constants, so you can't write a mixture in that kind of loop. The real problem is that even if you could, it would imply arbitrarily distinct regimes for y based on what x is, so no smooth way to transition from drawing from one component of the mixture to the other when x crosses zero.

bob-carpenter commented 8 years ago

There are lots of ways to generate draws from this, the most efficient being the pure Monte Carlo method with generated quantities. You can use an auxiliary variable method that mixes, but there's a huge problem (see below):

parameters {
  real x;
  real y1;
  real<lower=0> y2;
}
transformed parameters {
  real y;
  y <- if_else(x < 0, y1, y2);
}
model {
  x ~ normal(0, 1);
  y1 ~ normal(10, 2);
  y2 ~ gamma(3, 3);
}

which yields completely independent x, y1, and y2 draws and returns y as a transformed parameter. And it all works as expected:

Inference for Stan model: mix.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

      mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
x     0.02    0.02 1.00 -1.92 -0.68  0.01  0.67  2.03  2752    1
y1   10.02    0.04 2.05  6.01  8.68 10.02 11.37 14.09  2874    1
y2    1.00    0.01 0.58  0.21  0.57  0.90  1.30  2.46  2500    1
y     5.47    0.08 4.73  0.27  0.91  3.15  9.97 13.59  3355    1

The huge problem alluded to earlier is that the variable y is useless because the entire model includiung y is still not differentiable. And by useless, I just mean you can't condition things on it, so I couldn't, for example, use y as a mixture prior.

betanalpha commented 8 years ago

Everyone stand back: computer scientists are trying to do statistics again and it’s not pretty. Firstly, the original link sent earlier in the thread doesn’t seem to work, but you can grab the paper at http://research.microsoft.com/pubs/255950/final.pdf.

Given the typical CS hyperbole, it’s understandable how one could get confused and think that there are deeper issues here. But this is largely a consequence of, to be it delicately, misunderstandings but the authors of that paper.

For example, in the “loop” example of Figure 3 Stan actually is getting the right answer. The problem is that the authors completely misinterpret the Stan language associating “~” with a sampling operation so that at the loop somehow recursively defines a convolution. But as discussed extensively in the manual, “~” is just syntactic sugar for incrementing the log target density, so

x ~ normal(x, 3);

literally translates to

log_target += normal_log(x, x, 3);

which evaluates to zero given the cancellation of the x’s. So the model actually being defined is just

x ~ normal(0, 1)

which is exactly what Stan reports!

Bob and Dan have already commented on the issues with the second model. It’s effectively trying to fit a discrete model which Stan is not equipped to handle. More importantly, Stan is in no way “silently” failing. Firstly, there should be at least a few divergences when both x < 0 and y < 0 and Stan tries to evaluate the gamma_log density at negative y. Secondly, you’re getting only O(1000) effective samples out of O(500,000) transitions, and for autocorrelations that large you can’t even really trust the effective sample size estimators (there are nasty feedbacks between the higher-order autocorrelations and the estimator variance, etc). This is also why the Rhats are so bad.

Ultimately what’s happening is that when Stan makes large jumps in x which requires huge jumps in y to move between the modes of those two components. Essentially the chain has to reconverge again and again, but it’s not given enough time to do so, inducing the large autocorrelations and untrusty worthy fits.

To conclude, the paper is very misleading and completely mischaracterizes Stan (I won’t make any comments about their awesome “solution”), so these aren’t really “pathological” models in the sense that Stan should fit them but can’t. In the loop case Stan is doing the right thing and in the mixture case Stan never claims to be able to fit that well and throws plenty of diagnostics that something is going wrong.

As Bob notes we can have the compiler flag these warnings, but it’s a tricky issue in that in some cases parameter-based conditions can be okay. So we’re left with another tough decision between facilitating a few power users or trying to automatically identify potential problems for newer users.

On Jun 14, 2016, at 4:09 AM, Torsten Scholak notifications@github.com wrote:

But this is all beside the point. These examples are just pathologic models from that Hur et al. paper, designed to make a point about how branches and loops in probabilistic programming languages can cause problems for MCMC-based inference.

TRANTANKHOA commented 8 years ago

I am not a STAN developer. However, I was intrigued by the accusation put forward against STAN. Therefore, I read the entire paper mentioned in the original post. Now I realise that the authors are all coming from a technical field that is not computational statistics, most likely computer science/machine learning. Before trying to clear up some confusion the authors have about MCMC in general and STAN in specific, I earnestly suggest the authors get at least one hardcore computational statistician involved in their team.

First of all, STAN obviously did not perform satisfactorily in the examples given by the authors. However, the cause of this disappointment has nothing to do with STAN and has everything to do with the way STAN was employed. As a complete software, STAN is capable of much more that just performing HMC. As we have seen from the previous posts, the examples given can easily be solved using generated quantity or transformed parameters, which only invoke the necessary random number generators but not HMC. However, the authors intentionally invoke the HMC algorithm to sample from these models. The first example results in a nonsmooth density. The second example in the same paper results in an 11 dimension density, disguised as a one-dimensional density. Both of these examples constitute intentional misuse of HMC. There is nothing wrong with neither the code nor the algorithm inside STAN. And it should be obvious at this point that STAN is not a fool-proof software, i.e. user are expected to use STAN properly with the help of the development team.

Second and more importantly, the authors are utterly confused about the theoretical foundation of MH or HMC or MCMC in general. Using the authors' notation, the program P is to be executed in each iteration to generate a new proposal. This is just wrong. That is not what happens in STAN or any MCMC algorithms. Instead, the program P is necessarily translated into a density function T, which is done automatically in STAN (and manually if we are using other MCMC algorithms that are not supported by STAN). The state in each iteration is generated from the proposal mechanism Q, which can be a density or something else (in the case of HMC, it is the discretized solution of the Hamiltonian equations). Anyway, we do not execute P in any iteration, we only need T and Q. Therefore the whole discussion about how STAN (and R2) going wrong because they are ignorant to the specifics of P is really misleading. From the reading, this seems to be an earnest misunderstanding, or not?

Third, the proposed solution in the paper essentially boils down to automatically expanding the sampling space to the total number of random number generations in the program P. Hence the original target T is preserved as a marginal density. Conventional Metropolis-Hastings and Reversible Jump (Peter Green 1995) already cover the theoretical support for this solution. I can't comprehend the notation given in the theorems/proofs given in the paper. It was truly a nightmare. While being theoretically valid, I fail to see why this is an innovative sampling approach. MH is notoriously known to fail in high dimension. Also, why anyone would run MH algorithm when we can draw i.i.d. samples? Also, how can I use the proposed approach to sample from a likelihood function that is not a standard density? I just don't see how. The given simulations just raise many more suspicions. The only thing results published were the computing time, no specific models, no graph, no data were given. How on earth is that publishable?

I really want to ask for my afternoon back! Truly!

suhailshergill commented 8 years ago

thanks @bob-carpenter for creating the compiler warning issue! @TRANTANKHOA the notation in the paper is truly a nightmare, and is a consequence from them approaching this from the PL side (denotational semantics etc). i'm not sure whether the errors in their use of stan were a result of misunderstanding or something more, but since that explanation suffices, i personally am ok giving them the benefit of doubt. while i can't give you back your afternoon, i would like to thank you for taking the time to go through the paper and bring your thoughts to light.

@betanalpha agreed that CS folks (of which i'm one) are trying to do stats :) probabilistic programming systems seem to give us plenty of rope, and we seem quite equipped to use that rope to hang ourselves. thanks for shedding light on the matter. The issue, in part, seems to be a notational one indeed. Specifically, ~ when used in CS/ML often stands for sampling, and so it is understandable why the authors made that mistake (in fact I did so too, forgetting as you rightly point out that in Stan ~ is just syntactic sugar for incrementing the log target density).

I'm not sure if the examples given in the paper were reduced versions (for didactic purposes) of models that the authors were having trouble with or something else, but it's illuminating to finally understand the details of why it happens (and what that means).

tscholak commented 8 years ago

Well, I am a physicist, and I spent several afternoons on that paper, too.

While I can’t say that I had a jolly good time making it through their notation, I liked the premise enough to warrant the issue some thought.

Btw, exchanging thoughts is really the only reason I am here, and this is getting less and less enjoyable. The tone in this thread has become increasingly heated, and I don’t condone that. In response to @betanalpha’s opening statement, I could equally (and wrongfully) claim that Stan is statisticians trying to do computer science, but these kind of accusations are not only demeaning, they are also absolutely not pertinent to discussing this issue.

Now, returning to the issue, I think the predominant opinion is that this is really a non-issue, because the original authors of that paper (purposely?) misunderstood what Stan does or what its semantics tell you it is doing (e.g., ~ being syntactic sugar for increment_log_prob). Be it as it may, I personally was somewhat disappointed to learn that Stan cannot be used for what the authors set out do: inferring the distribution of variables in an imperative probabilistic program with ifs, elses and loops, which I consider to be an interesting and challenging problem. We don’t need to talk about MH being an abysmal answer to it, I guess that is obvious. However, we should acknowledge that there might be a need for communicating to the user that conditioning on parameters is a bad idea, and I am grateful to @bob-carpenter for taking the initiative on this matter, see https://github.com/stan-dev/stan/issues/1918.

@TRANTANKHOA I don’t think at all that the paper is saying that the program P is executed in each iteration to generate a new proposal. What is executed is their INFER algorithm that takes a P and uses it do build T by means of EVAL.

TRANTANKHOA commented 8 years ago

Well, here is a quote from the paper: "In order to perform MH sampling on a probabilistic program, we need to run the program P which results in the state being constructed incrementally, as P executes along a path. Along the path, the program encounters several probabilistic assignments to variables, and to generate a new value for each variable, we require the corresponding old value of the variable from the previous run of the program. Such an association between old and new values is easy if each variable is assigned only once, or if the program is a single path without branches or loops." So indeed, that is what they say.

Also, STAN can be used for what the authors set out to do: inferring the distribution of variables in an imperative probabilistic program with ifs, elses and loops. The authors just did not do it properly.

Also, don't you notice that what the authors set out to do boils down only to i.i.d. sampling all the time and that is not interesting at all.

I don't think anyone here not having enough respect for computer scientists. However, condemning bad science is necessary too. Maybe Michael was not as politically correct as he should have. But it is just calling out the hard truth. Given that Michael is one of the core developers of this software, I can totally understand why he is offended by the unfounded accusation that STAN is fundamentally flawed. Asking for help solving a problem is one thing. Pointing out a problem for other people while it is not there is another. So I don't think Michael started the fight first. Notice that I'm not a STAN developer myself.

sakrejda commented 8 years ago

@TRANTANKHOA @suhailshergill @tscholak I really appreciate that you guys are willing to be involved in this discussion at such depth. This is one of those corners of Stan that's hard to explain properly in part because of ill-informed attacks such as the one laid out in the above-mentioned paper. I like that we can keep it civil (most of the time?) but could we move further discussion to the users' list where other people might benefit from it and understand/join in?

TRANTANKHOA commented 8 years ago

Hi @sakrejda , as far as contributing to the discussion. I think I already made my points. But people are welcome to carry on to the users' list. Cheers!

tscholak commented 8 years ago

@TRANTANKHOA They write "incrementally", not "repeatedly", which is what you first alluded to.

@sakrejda Absolutely. Can a conversation (and its history) migrated to the mailing list?

bob-carpenter commented 8 years ago

On Jun 14, 2016, at 8:30 AM, Torsten Scholak notifications@github.com wrote:

Well, I am a physicist, and I spent several afternoons on that paper, too.

While I can’t say that I had a jolly good time making it through their notation, I liked the premise enough to warrant the issue some thought.

Btw, exchanging thoughts is really the only reason I am here, and this is getting less and less enjoyable. The tone in this thread has become increasingly heated, and I don’t condone that.

Me, either. Nothing to be gained that way.

In response to @betanalpha’s opening statement, I could equally (and wrongfully) claim that Stan is statisticians trying to do computer science, but these kind of accusations are not only demeaning, they are also absolutely not pertinent to discussing this issue.

I completely agree about the demeaning bit. We get it a lot the other way from computer scientists, and it's not pleasant. And completely uncalled for.

For what it's worth, Stan's mostly built by people with CS backgrounds, like me and Matt Hoffman, or physics backgrounds, like Michael. Andrew's the only proper statistician in the group (as in having a Ph.D. in stats), and he's not involved in the code-level development.

Now, returning to the issue, I think the predominant opinion is that this is really a non-issue, because the original authors of that paper (purposely?) misunderstood what Stan does or what its semantics tell you it is doing (e.g., ~ being syntactic sugar for increment_log_prob).

I don't think they misunderstood what Stan was doing on purpose. I think it's up to us to make that clear. There's an entire DARPA (U.S. defense research funding) project aimed at building the kind of probabilistic programming languages that can fit the kind of model first posed. Why, we're not so sure. These types of models tend not to come up in our applied work, and are not the kind of thing most of our users would come up with.

Be it as it may, I personally was somewhat disappointed to learn that Stan cannot be used for what the authors set out do: inferring the distribution of variables in an imperative probabilistic program with ifs, elses and loops, which I consider to be an interesting and challenging problem.

Challenging, yes. Not so sure what it's useful for, but we try to stay open-minded about interesting use cases and don't claim that Stan can do everything.

Stan can deal with loops and conditionals, but not if they introduce discontinuities in the densities.

We don’t need to talk about MH being an abysmal answer to it, I guess that is obvious. However, we should acknowledge that there might be a need for communicating to the user that conditioning on parameters is a bad idea, and I am grateful to @bob-carpenter for taking the initiative on this matter, see #1918.

The issue is my way of agreeing. I'll fix that and try to write some clearer doc.

bob-carpenter commented 8 years ago

On Jun 14, 2016, at 7:12 AM, Suhail Shergill notifications@github.com wrote:

... Specifically, ~ when used in CS/ML often stands for sampling,

It usually doesn't mean that anywhere. Even in something like Church, the ~ just specifies the forward model. But you can't literally execute it as a random sample and hope to get anywhere.

That's why we've been considering removing ~ from the language altogether so that it's clear to users that all a Stan program does is define a target density on the log scale up to an additive constant that doesn't depend on parameters.

bob-carpenter commented 8 years ago

@betanalpha: Thanks for clarifying what's going on so clearly.

Please don't stereotype whole groups of scientists as you do in the first sentences of the first two paragraphs. I think everyone's trying to do the right thing here, and comments like this only serve to annoy people. I very much want to maintain our friendly lists, even if someone comes onto our list looking to start a fight.

To mix family and basketball metaphors, it's like we're older siblings---if the younger sibling punches us first, we still get the foul for retaliating.

On Jun 14, 2016, at 5:13 AM, Michael Betancourt notifications@github.com wrote:

Everyone stand back: computer scientists are trying to do statistics again and it’s not pretty. Firstly, the original link sent earlier in the thread doesn’t seem to work, but you can grab the paper at http://research.microsoft.com/pubs/255950/final.pdf.

Given the typical CS hyperbole, it’s understandable how one could get confused and think that there are deeper issues here. But this is largely a consequence of, to be it delicately, misunderstandings but the authors of that paper.

For example, in the “loop” example of Figure 3 Stan actually is getting the right answer. The problem is that the authors completely misinterpret the Stan language associating “~” with a sampling operation so that at the loop somehow recursively defines a convolution. But as discussed extensively in the manual, “~” is just syntactic sugar for incrementing the log target density, so

x ~ normal(x, 3);

literally translates to

log_target += normal_log(x, x, 3);

which evaluates to zero given the cancellation of the x’s. So the model actually being defined is just

x ~ normal(0, 1)

which is exactly what Stan reports!

Bob and Dan have already commented on the issues with the second model. It’s effectively trying to fit a discrete model which Stan is not equipped to handle. More importantly, Stan is in no way “silently” failing. Firstly, there should be at least a few divergences when both x < 0 and y < 0 and Stan tries to evaluate the gamma_log density at negative y. Secondly, you’re getting only O(1000) effective samples out of O(500,000) transitions, and for autocorrelations that large you can’t even really trust the effective sample size estimators (there are nasty feedbacks between the higher-order autocorrelations and the estimator variance, etc). This is also why the Rhats are so bad.

Ultimately what’s happening is that when Stan makes large jumps in x which requires huge jumps in y to move between the modes of those two components. Essentially the chain has to reconverge again and again, but it’s not given enough time to do so, inducing the large autocorrelations and untrusty worthy fits.

To conclude, the paper is very misleading and completely mischaracterizes Stan (I won’t make any comments about their awesome “solution”), so these aren’t really “pathological” models in the sense that Stan should fit them but can’t. In the loop case Stan is doing the right thing and in the mixture case Stan never claims to be able to fit that well and throws plenty of diagnostics that something is going wrong.

As Bob notes we can have the compiler flag these warnings, but it’s a tricky issue in that in some cases parameter-based conditions can be okay. So we’re left with another tough decision between facilitating a few power users or trying to automatically identify potential problems for newer users.

On Jun 14, 2016, at 4:09 AM, Torsten Scholak notifications@github.com wrote:

But this is all beside the point. These examples are just pathologic models from that Hur et al. paper, designed to make a point about how branches and loops in probabilistic programming languages can cause problems for MCMC-based inference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

andrewgelman commented 8 years ago

Hi all. I didn't follow all the details of this discussion, but as we've been taking about the verbose parser that picks up possible errors, it strikes me that this is another thing that we can add: certain if/else statements that are bad news for HMC. A

On Jun 14, 2016, at 7:12 AM, Suhail Shergill notifications@github.com wrote:

thanks @bob-carpenter https://github.com/bob-carpenter for creating the compiler warning issue! @TRANTANKHOA https://github.com/TRANTANKHOA the notation in the paper is truly a nightmare, and is a consequence from them approaching this from the PL side (denotational semantics etc). i'm not sure whether the errors in their use of stan were a result of misunderstanding or something more, but since that explanation suffices, i personally am ok giving them the benefit of doubt. while i can't give you back your afternoon, i would like to thank you for taking the time to go through the paper and bring your thoughts to light.

@betanalpha https://github.com/betanalpha agreed that CS folks (of which i'm one) are trying to do stats :) probabilistic programming systems seem to give us plenty of rope, and we seem quite equipped to use that rope to hang ourselves. thanks for shedding light on the matter. The issue, in part, seems to be a notational one indeed. Specifically, ~ when used in CS/ML often stands for sampling, and so it is understandable why the authors made that mistake (in fact I did so too, forgetting as you rightly point out that in Stan ~ is just syntactic sugar for incrementing the log target density).

I'm not sure if the examples given in the paper were reduced versions (for didactic purposes) of models that the authors were having trouble with or something else, but it's illuminating to finally understand the details of why it happens (and what that means).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/1917#issuecomment-225850616, or mute the thread https://github.com/notifications/unsubscribe/ADOBUaO6IDo-_9QArBi4PrbePQdogD1Dks5qLoy6gaJpZM4I02oy.

suhailshergill commented 8 years ago

It usually doesn't mean that anywhere. Even in something like Church, the ~ just specifies the forward model.

@bob-carpenter my bad. when i said "sampling" i meant drawing/sampling a value from a distribution in the sense of a generative process (i.e. "forward model").

But you can't literally execute it as a random sample and hope to get anywhere.

agreed.

thank you for your constructive feedback; taking some of my followup thoughts to #1918

bob-carpenter commented 8 years ago

On Jun 14, 2016, at 11:27 PM, Suhail Shergill notifications@github.com wrote:

It usually doesn't mean that anywhere. Even in something like Church, the ~ just specifies the forward model.

@bob-carpenter my bad. when i said "sampling" i meant drawing/sampling a value from a distribution in the sense of a generative process (i.e. "forward model").

This is where we got the notation, but misinterpretation of execution is why we're considering just eliminating ~.

When one writes

y ~ foo(theta)

to specify a graphical/generative model, the implication is that it contributes the following term to the joint log density

p(y, theta, ... ) += log foo(y | theta) + ...

Stan literally executes the increment when it sees the ~ (and drops constants in the evaluation of foo). BUGS or PyMC build up graphical models, which BUGS executes with Gibbs (or non-conjugate generalizations thereof), whereas PyMC3 does what Stan does and converts to a log density for sampling with NUTS. Execution in something like Church/Venture, etc., is more complicated; it defines the program structure, but the program structure isn't just executed by running the forward model.