stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.59k stars 370 forks source link

Incorrect behaviour of CCDFs when there is a negative random variable #1142

Closed randommm closed 9 years ago

randommm commented 9 years ago

The following program:

transformed data {
  int y[2];
  y[1] <- 0;
  y[2] <- 1;
  print(poisson_ccdf_log(0, 5));
  print(poisson_ccdf_log(1, 5));
  print(poisson_ccdf_log(y, 5));
}
parameters {
  real any;
}
model {
  any ~ normal(0, 1);  
}

Outputs:

-0.00676075 -0.0412676 -0.0480283

(the last number is the sum of the first two as expected)

However, the following program:

transformed data {
  int y[2];
  y[1] <- -1;
  y[2] <- 1;
  print(poisson_ccdf_log(-1, 5));
  print(poisson_ccdf_log(1, 5));
  print(poisson_ccdf_log(y, 5));
}
parameters {
  real any;
}
model {
  any ~ normal(0, 1);  
}

Outputs:

0
-0.0412676
0

Which doesn't make sense: the last number should have been -0.0412676 too.

The way the CCDFs are currently designed makes them ignore everything when there is a negative random variable inside the vector of random variables:

  // Explicit return for extreme values
  // The gradients are technically ill-defined, but treated as neg infinity
  for (size_t i = 0; i < stan::length(n); i++) {
    if (value_of(n_vec[i]) < 0) 
      return operands_and_partials.to_var(0.0);
  }
randommm commented 9 years ago

Another closelly related bug, for neg_binomial_ccdf and neg_binomial_cdf, it was used <= instead of <

  for (size_t i = 0; i < stan::length(n); i++) {
    if (value_of(n_vec[i]) <= 0) 
      return operands_and_partials.to_var(0.0);
  }

Because of that:

neg_binomial_ccdf_log(0, 5, 5) == 0 == log(1) but this is incorrect since P(X > 0) != 1

and

neg_binomial_cdf_log(0, 5, 5) == -inf == log(0) but this is incorrect since P(X <= 0) != 0

transformed data {
  print(neg_binomial_ccdf_log(0, 5, 5));
  print(neg_binomial_cdf_log(0, 5, 5));
}
parameters {
  real any;
}
model {
  any ~ normal(0, 1);  
}

outputs:

0
-inf
bob-carpenter commented 9 years ago

We'd had a discussion before about what to do for CDFs and CCDFs when there is a value outside support and also what to do with -inf and +inf.

I can't actually recall what we decided in terms of policy, but I think the right place to record it is in the doc. Maybe a section before all the distribution-related functions are defined.

Also, CCDF on the log scale is going to return -inf, which seems right on the log scale, but is going to be problematic for many computations. So we should also put that in the doc to let people know what to expect.

If the policy gets outlined here in this issue, I can write it into the doc if you put a pointer into the next manual issue.

On Nov 19, 2014, at 9:05 PM, Marco Inacio notifications@github.com wrote:

Another closelly related bug, for neg_binomial_ccdf and neg_binomial_cdf, it was used <= instead of <

for (size_t i = 0; i < stan::length(n); i++) { if (value_of(n_vec[i]) <= 0) return operands_and_partials.to_var(0.0); }

Because of that:

neg_binomial_ccdf_log(0, 5, 5) == 0 == log(1) but this is incorrect since P(X > 0) != 1

and

neg_binomial_cdf_log(0, 5, 5) == -inf == log(0) but this is incorrect since P(X <= 0) != 0

transformed data { print(neg_binomial_ccdf_log(0, 5, 5)); print(neg_binomial_cdf_log(0, 5, 5)); } parameters { real any; } model { any ~ normal(0, 1);
}

outputs:

0 -inf

— Reply to this email directly or view it on GitHub.

randommm commented 9 years ago

I don't this is a matter of documentation, I really think the behaviour is wrong since it calculates the wrong probabilities.

transformed data {
  int vals[2];
  vals[1] <- -1;
  vals[2] <- 5;
  print(poisson_cdf_log(vals, any_parameter));
  print(poisson_ccdf_log(vals, any_parameter));
}

Do you agree with me that in the example above the log_CDF calculates: log(P(X <= -1) P(X <= 5)) and the log_CCDF calculates: log(P(X > -1) P(X > 5)) ? [where X ~ poisson(any_parameter)]

If so, therefore:

CDF: P(X <= -1) P(X <= 5) = 0 * P(X <= 5) = 0 Log_CDF: log(0) = -inf (And that's what Stan will give you. So the CDFs and log_CDFs are OK.)

CCDF ->>>> P(X > -1) P(X > 5) = 1 * P(X > 5) = P(X > 5) Log_CCDF ->>>> log(P(X > 5)) But, the current Stan implementation will give you log(1) = 0 which doesn't make sense and might break programs.

bob-carpenter commented 9 years ago

On Nov 20, 2014, at 12:12 PM, Marco Inacio notifications@github.com wrote:

I don't this is a matter of documentation,

I didn't mean to imply it was. I just think the right thing to do is define in the doc what we want the general behavior to be, then check that all the functions match that.

Is there a spec somewhere or does someone want to chime in on the decision that was made.

Let's say we have a continuous uniform(0,1) density. In particular, I think we need consistent behavior and doc for

I really think the behaviour is wrong since it calculates the wrong probabilities.

Understood.

For a and b in support, we clearly want to obey

a <= b ===> CDF(a, theta) <= CDF(b, theta) CCDF(a, theta) => CCDF(b, theta)

which Marco's example is violating (if I understood properly).

I don't think we can get strict monotonicity because of overflow and underflow.

Then the question's what to do if a and/or b are out of support or on the boundary of support.

transformed data { int vals[2]; vals[1] <- -1; vals[2] <- 5; print(poisson_cdf_log(vals, any_parameter)); print(poisson_ccdf_log(vals, any_parameter)); }

Do you agree with me that in the example above the log_CDF calculates: log(P(X <= -1) P(X <= 5)) and the log_CCDF calculates: log(P(X > -1) P(X > 5)) ? [where X ~ poisson(any_parameter)]

If so, therefore:

CDF: P(X <= -1) P(X <= 5) = 0 * P(X <= 5) = 0 Log_CDF: log(0) = -inf (And that's what Stan will give you. So the CDFs and log_CDFs are OK.)

CCDF ->>>> P(X > -1) P(X > 5) = 1 * P(X > 5) = P(X > 5) Log_CCDF ->>>> log(P(X > 5)) But, the current Stan implementation will give you log(1) = 0 which doesn't make sense and might break programs.

— Reply to this email directly or view it on GitHub.

randommm commented 9 years ago

Agreed about defining the behaviour in the doc.

I haven't find any issues with inf/-inf or boundaries (except, as I said earlier, for the boundary bug in neg_binomial because it was used <= instead of <, but that's simple to fix).

Also, fixing the problem I outlined (in poisson_ccdf) seems straightforward: 7f15b736ab48099f4309accd34d418d11668c227 after such patch, the following example:

transformed data {
  int y[2];
  y[1] <- -1;
  y[2] <- 1;
  print(poisson_ccdf_log(-1, 5));
  print(poisson_ccdf_log(1, 5));
  print(poisson_ccdf_log(y, 5));
}
parameters {
  real any;
}
model {
  any ~ normal(0, 1);  
}

Will output:

0
-0.0412676
-0.0412676

As expected.

randommm commented 9 years ago

What was the conclusion of the meetings and discussions about this? Will it raise an exception or will it return 0 (1 for ccdf) when out of bounds?

bob-carpenter commented 9 years ago

We decided they should raise exceptions. Daniel's going through and fixing all the bounds, I believe.

On Jan 9, 2015, at 8:41 AM, Marco Inacio notifications@github.com wrote:

What was the conclusion of the meetings and discussions about this? Will it raise an exception or will it return 0 (1 for ccdf) when out of bounds?

— Reply to this email directly or view it on GitHub.

rtrangucci commented 9 years ago

@bob-carpenter This seems like the right issue to take care of consistent bounds checking / handling of pdfs because we'll need to make the behavior consistent for the corresponding cdf / cdf_log / ccdf_log

bob-carpenter commented 9 years ago

Let's do the pdf/pmf and cdf in different issues --- they seem like separate decisions to me of what to do with values outside of support.

On Jun 1, 2015, at 5:26 PM, Rob Trangucci notifications@github.com wrote:

@bob-carpenter This seems like the right issue to take care of consistent bounds checking / handling of pdfs because we'll need to make the behavior consistent for the corresponding cdf / cdf_log / ccdf_log

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

This issue was moved to stan-dev/math#56