stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.03k stars 264 forks source link

Get "infinite gradient" error for code with finite gradient #275

Open ksvanhorn opened 8 years ago

ksvanhorn commented 8 years ago

Summary:

I have a small example program that dies with "Rejecting initial value: Gradient evaluated at the initial value is not finite" even though the log probability has a finite gradient everywhere.

Description:

This is illustrated in the same example program as given in issue #274. Referencing the model code given there, I get the error if I keep the assignment to z_lo, but not if I remove the assignment to z_lo. But z_lo never enters into the log probability in any way!

The example may seem contrived, but it is boiled down from a more reasonable example in which z_lo is always computed, but never used in the cases for which lo is infinite.

Reproducible Steps:

See issue #274.

RStan Version:

2.9.0

R Version:

3.2.1

Operating System:

Mac OS X El Capitan

ksvanhorn commented 8 years ago

For easier reference, here is the code again. The error goes away if you do any of the following: 1) assign a finite value to lo, 2) replace "z_lo <- lo * scale;" with just "z_lo <- lo * 3.0;", or 3) remove the assignment to z_lo altogether.

model_code <- '
data {
}
transformed data {
  real lo;
  lo <- negative_infinity();
  print("Finished transforming data.");
}
parameters {
  real scale;
}
model {
  print("Starting model.");
  scale ~ normal(0, 1);
  print("A");
  {
    real z_lo;
    print("B");
    z_lo <- lo * scale;
  }
}
'

sm <- stan_model(model_code = model_code)
fit <- vb(sm, data=list())
bob-carpenter commented 8 years ago

Thanks much for the reproducible issue. I verified it provides the same result with NUTS:

> fit <- stan(model_code=model_code)
Finished transforming data.

SAMPLING FOR MODEL '068ea98339576f574dee034a11fc1104' NOW (CHAIN 1).
  [1] "Error : Rejecting initial value:"                                                                     
  [2] "  Gradient evaluated at the initial value is not finite."                                             
  [3] "  Stan can't start sampling from this initial value."                                                 
  [4] "Rejecting initial value:"                                                                             
  [5] "  Gradient evaluated at the initial value is not finite."                                             
  [6] "  Stan can't start sampling from this initial value."       

...

[301] "Initialization between (--2, -2) failed after 100 attempts. "                                         
[302] " Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model."
error occurred during calling the sampler; sampling not done
bob-carpenter commented 8 years ago

What's happening is that the expression graph has

lp__ -> scale
z_lo -> (lo * scale)
lo -> [data: value -infinity]
scale -> [parameter]

and I think we're naively passing down the adjoint from z_lo to scale, which by the chain rule should be

scale.adjoint += z_lo.adjoint * lo;

and the problem is that z_lo.adjoint is going to be 0, which will fail with NaN when lo is infinite.

I'm not sure what the right response is here. It's tempting to call it user error, but it's cruel to our users to make them understand autodiff at this level. It'd be nice to do the graph closure, but I don't think that's easy to do statically. So that leaves a run-time fix.

One thing we could do is update all the chain rule methods to look like this:

if (z.lo.adjoint != 0)
  scale.adjoint += z_lo.adjoint * lo;

Or we could put that straight into the propagation rule.

But what worries me about this is that it seems wrong numerically to test for a value then do orindary arithmetic if it's one thing and not otherwise.

betanalpha commented 8 years ago

How could that happen when z_lo isn’t a root in the expression graph? The reverse mode passes should skip z_lo entirely and hence never propagate the NaN to the rest of the adjoints.

On Mar 8, 2016, at 5:47 AM, Bob Carpenter notifications@github.com wrote:

What's happening is that the expression graph has

lp__ -> scale z_lo -> (lo * scale) lo -> [data: value -infinity] scale -> [parameter] and I think we're naively passing down the adjoint from z_lo to scale, which by the chain rule should be

scale.adjoint += z_lo.adjoint * lo; and the problem is that z_lo.adjoint is going to be 0, which will fail with NaN when lo is infinite.

I'm not sure what the right response is here. It's tempting to call it user error, but it's cruel to our users to make them understand autodiff at this level. It'd be nice to do the graph closure, but I don't think that's easy to do statically. So that leaves a run-time fix.

One thing we could do is update all the chain rule methods to look like this:

if (z.lo.adjoint != 0) scale.adjoint += z_lo.adjoint * lo; Or we could put that straight into the propagation rule.

But what worries me about this is that it seems wrong numerically to test for a value then do orindary arithmetic if it's one thing and not otherwise.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 8 years ago

I bet you find the same behavior in Nomad. The problem is that we just stack up the vari* accoding to the implicit topological sort of the expressions, then do the reverse pass in order down the stack. Here, the problem is that we get this for a stack, with each line being a new vari.

root = sum(lp_accumulator) z_lo = lo * scale temp1 = normal(scale | 0, 1) [ temp1 is new variable, add to lp_accumulator ] [ jacobian goes onto lp accumulator] scale = input lp = 0

So when we pass from the root down, we pass through z_lo, which pollutes scale's gradient.

To avoid visiting unused segments of the graph, we'd have to do a connectivity analysis of the whole thing from the bottom down. We can do that implicitly as we do the reverse pass by testing for adjoint == 0. I have no idea how much this will slow down reverse mode, if at all.

On Mar 8, 2016, at 4:06 AM, Michael Betancourt notifications@github.com wrote:

How could that happen when z_lo isn’t a root in the expression graph? The reverse mode passes should skip z_lo entirely and hence never propagate the NaN to the rest of the adjoints.

On Mar 8, 2016, at 5:47 AM, Bob Carpenter notifications@github.com wrote:

What's happening is that the expression graph has

lp__ -> scale z_lo -> (lo * scale) lo -> [data: value -infinity] scale -> [parameter] and I think we're naively passing down the adjoint from z_lo to scale, which by the chain rule should be

scale.adjoint += z_lo.adjoint * lo; and the problem is that z_lo.adjoint is going to be 0, which will fail with NaN when lo is infinite.

I'm not sure what the right response is here. It's tempting to call it user error, but it's cruel to our users to make them understand autodiff at this level. It'd be nice to do the graph closure, but I don't think that's easy to do statically. So that leaves a run-time fix.

One thing we could do is update all the chain rule methods to look like this:

if (z.lo.adjoint != 0) scale.adjoint += z_lo.adjoint * lo; Or we could put that straight into the propagation rule.

But what worries me about this is that it seems wrong numerically to test for a value then do orindary arithmetic if it's one thing and not otherwise.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

betanalpha commented 8 years ago

Ah, yes, I remember this happening a long time ago. The topological sort of the nodes in the expression graph doesn’t prevent zero * NaN or zero * inf adjoints from polluting the graph.

Honestly I’m okay with this behavior as Nan and inf manipulations is generally ill-defined.

On Mar 8, 2016, at 10:48 PM, Bob Carpenter notifications@github.com wrote:

I bet you find the same behavior in Nomad. The problem is that we just stack up the vari* accoding to the implicit topological sort of the expressions, then do the reverse pass in order down the stack. Here, the problem is that we get this for a stack, with each line being a new vari.

root = sum(lp_accumulator) z_lo = lo * scale temp1 = normal(scale | 0, 1) [ temp1 is new variable, add to lp_accumulator ] [ jacobian goes onto lp accumulator] scale = input lp = 0

So when we pass from the root down, we pass through z_lo, which pollutes scale's gradient.

To avoid visiting unused segments of the graph, we'd have to do a connectivity analysis of the whole thing from the bottom down. We can do that implicitly as we do the reverse pass by testing for adjoint == 0. I have no idea how much this will slow down reverse mode, if at all.

  • Bob

On Mar 8, 2016, at 4:06 AM, Michael Betancourt notifications@github.com wrote:

How could that happen when z_lo isn’t a root in the expression graph? The reverse mode passes should skip z_lo entirely and hence never propagate the NaN to the rest of the adjoints.

On Mar 8, 2016, at 5:47 AM, Bob Carpenter notifications@github.com wrote:

What's happening is that the expression graph has

lp__ -> scale z_lo -> (lo * scale) lo -> [data: value -infinity] scale -> [parameter] and I think we're naively passing down the adjoint from z_lo to scale, which by the chain rule should be

scale.adjoint += z_lo.adjoint * lo; and the problem is that z_lo.adjoint is going to be 0, which will fail with NaN when lo is infinite.

I'm not sure what the right response is here. It's tempting to call it user error, but it's cruel to our users to make them understand autodiff at this level. It'd be nice to do the graph closure, but I don't think that's easy to do statically. So that leaves a run-time fix.

One thing we could do is update all the chain rule methods to look like this:

if (z.lo.adjoint != 0) scale.adjoint += z_lo.adjoint * lo; Or we could put that straight into the propagation rule.

But what worries me about this is that it seems wrong numerically to test for a value then do orindary arithmetic if it's one thing and not otherwise.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 8 years ago

:-) That's why I said I wasn't sure what the right response was. In some sense, it'd be nice to eliminate. On the other hand, when it happens, it's indicating an issue in the code.

On Mar 8, 2016, at 5:58 PM, Michael Betancourt notifications@github.com wrote:

Ah, yes, I remember this happening a long time ago. The topological sort of the nodes in the expression graph doesn’t prevent zero * NaN or zero * inf adjoints from polluting the graph.

Honestly I’m okay with this behavior as Nan and inf manipulations is generally ill-defined.

On Mar 8, 2016, at 10:48 PM, Bob Carpenter notifications@github.com wrote:

I bet you find the same behavior in Nomad. The problem is that we just stack up the vari* accoding to the implicit topological sort of the expressions, then do the reverse pass in order down the stack. Here, the problem is that we get this for a stack, with each line being a new vari.

root = sum(lp_accumulator) z_lo = lo * scale temp1 = normal(scale | 0, 1) [ temp1 is new variable, add to lp_accumulator ] [ jacobian goes onto lp accumulator] scale = input lp = 0

So when we pass from the root down, we pass through z_lo, which pollutes scale's gradient.

To avoid visiting unused segments of the graph, we'd have to do a connectivity analysis of the whole thing from the bottom down. We can do that implicitly as we do the reverse pass by testing for adjoint == 0. I have no idea how much this will slow down reverse mode, if at all.

  • Bob

On Mar 8, 2016, at 4:06 AM, Michael Betancourt notifications@github.com wrote:

How could that happen when z_lo isn’t a root in the expression graph? The reverse mode passes should skip z_lo entirely and hence never propagate the NaN to the rest of the adjoints.

On Mar 8, 2016, at 5:47 AM, Bob Carpenter notifications@github.com wrote:

What's happening is that the expression graph has

lp__ -> scale z_lo -> (lo * scale) lo -> [data: value -infinity] scale -> [parameter] and I think we're naively passing down the adjoint from z_lo to scale, which by the chain rule should be

scale.adjoint += z_lo.adjoint * lo; and the problem is that z_lo.adjoint is going to be 0, which will fail with NaN when lo is infinite.

I'm not sure what the right response is here. It's tempting to call it user error, but it's cruel to our users to make them understand autodiff at this level. It'd be nice to do the graph closure, but I don't think that's easy to do statically. So that leaves a run-time fix.

One thing we could do is update all the chain rule methods to look like this:

if (z.lo.adjoint != 0) scale.adjoint += z_lo.adjoint * lo; Or we could put that straight into the propagation rule.

But what worries me about this is that it seems wrong numerically to test for a value then do orindary arithmetic if it's one thing and not otherwise.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.