haskell / statistics

A fast, high quality library for computing with statistics in Haskell.
http://hackage.haskell.org/package/statistics
BSD 2-Clause "Simplified" License
300 stars 68 forks source link

Improve accuracy of Geometric Distribution #194

Closed lorinder closed 1 year ago

lorinder commented 1 year ago

The probability mass function of the geometric distribution had some accuracy problems. Example:

ghci> da = (geometric (35 / 65536))
ghci> pa = Statistics.Distribution.probability da 106822
ghci> pa
8.812340833341934e-29

The correct answer in this case is 8.812340833326208e-29.

Problematic expressions are of the form log (1-x) for small x; replace them by log1p (-x).

Also exclude the invalid parameter choice p = 0.

Shimuuar commented 1 year ago

Thank you!

Everything is good except computation of s * (1-s) ** (fromIntegral n - 1). I thrown together quick script to check error of both approaches, images are below. Naive is old one, log1p is proposed. Z axis is relative error divided by floating point ε, so ideally it should be 1 and less.

log1p method clearly outperforms naive for small s but fails for points closer to 1. I didn't figure out why is it so and whether there's nice switchover point

P.S. I hope there's no problem with images

image image

lorinder commented 1 year ago

Thanks for these! They are very helpful. I did indeed miss the regression this causes for larger $s$. I think I've fixed this issue by switching over to the elementary implementation for larger $s$. This improved the accuracy for $s \geq 1/2$ to better than $10^{-14}$, according to my tests. For lower $s$ it is unchanged, and still at $10^{-13}$. I think the problem with the smaller $s$ value is that the need to use the log domain causes it to be a little worse than ideal.

To make collaboration easier here, I've also created the pull request that contains the code that I'm using to test these distributions, here: https://github.com/haskell/statistics/pull/199

I've found that code useful to debug and evaluate these particular distributions.

lorinder commented 1 year ago

After re-reading through it, I have some doubts about the new implementation of cumulative. It seems that using the naive $1 - (1 - s)^k$ formula may not be the best choice even for $s \geq 1/2$. Let me recheck this.

Shimuuar commented 1 year ago

Yeah. For $s \geq 1/2$ $1-s$ is exactly representable so we can use direct computation.

It seems that using the naive $1 - (1 - s)^k$ formula may not be the best choice even for $s \geq 1/2$. Let me recheck this. I'm not sure about this. In that case $1-s$ could be computed exactly. $(1 - s)^k$ will be even smaller so any rounding errors will washed out when it's subtracted from 1.

Maybe there's better way but I don't se any obvious problems

Shimuuar commented 1 year ago

I finally got time to get back to this PR. I rechecked implementation and yes it does improve precision. Thanks!