statrs-dev / statrs

Statistical computation library for Rust
https://docs.rs/statrs/latest/statrs/
MIT License
546 stars 78 forks source link

Negative binomial cdf floating point precision not within range of f64 minimum positive. #171

Closed noamteyssier closed 1 year ago

noamteyssier commented 1 year ago

Hey all, I noticed a small discrepancy between floating point precision of the negative binomial CDF when compared to scipy. For very unlikely observations it seems like statrs is rounding down to zero even though the precision should be within range of f64 precision. I suspect there's some rounding going on somewhere in the regularized incomplete beta function calculation when under some set precision but haven't explored too deeply yet.

Here's a minimal reproducible example of the rust code and the python code for comparison of results.

use statrs::{distribution::{NegativeBinomial, DiscreteCDF}, function::beta::beta_reg};

fn main() {
    let (x, r, p) = (2321., 269., 0.0405);
    let cdf = NegativeBinomial::new(r, p).unwrap().cdf(x as u64);
    println!("{:?}", cdf);
}

which results in 0.0

and the scipy example which under the hood pipes to boost ibeta

from scipy.stats import nbinom

x, r, p = 2321, 269, 0.0405
cdf = nbinom.cdf(x, r, p)
print(cdf)

which results in 2.803588112796817e-43

For visualization this is what the parameterized distribution and associated observation looks like - you can see this is incredibly unlikely so ultimately the value is incredibly close to zero, but since the f64 minimum positive is 2.2250738585072014E-308f64 it should still be within range of precision. image

WarrenWeckesser commented 1 year ago

The value 2.80358811[...]e-43 can be independently verified with Wolfram Alpha.

The problem is here: https://github.com/statrs-dev/statrs/blob/0c8fb1a18d1ddf95cf40b35ae3968d8745dc4cd8/src/distribution/negative_binomial.rs#L114 beta_reg is the regularized lower incomplete beta function. For the given parameters, the 64 bit floating point result of that beta_reg call is 1.0, so when it is subtracted from 1.0 the result is 0; the difference that we wanted is lost.

By using the symmetry of the function (see, for example, Incomplete beta function - Properties), that expression can be changed to

    beta::beta_reg(self.r, x as f64 + 1.0, self.p)

(which also nicely avoids the other precision-killing subtraction 1.0 - self.p). If I make that change, all the existing tests pass, and the cdf method with x=2321, r=269, p=0.0405 returns 2.803588112801757e-43, which agrees with the true value with a relative error less than 1e-11.

noamteyssier commented 1 year ago

Not sure what the scope of this project is - but this sort of problem makes me think that having a dedicated survival function may be a good idea to avoid potential precision-killing subtraction such as the method above - since a 1-cdfwould likely be used post hoc by the user to calculate it when using these distributions in practice.

It could potentially be included on the ContinuousCDF and DiscreteCDF traits? Would be happy to try to work on this if there is interest.

WarrenWeckesser commented 1 year ago

... having a dedicated survival function may be a good idea to avoid potential precision-killing subtraction

I agree, a dedicated survival function (where possible) is important to have. This is a lesson we learned in SciPy. Using 1 - cdf(x) for the survival function (also known as the complementary CDF) kills all precision in the right tail. A dedicated inverse survival function should also be an option.