haskell / statistics

A fast, high quality library for computing with statistics in Haskell.
http://hackage.haskell.org/package/statistics
BSD 2-Clause "Simplified" License
300 stars 68 forks source link

Incorrect histogram_ bounds check #163

Closed GregorySchwartz closed 2 years ago

GregorySchwartz commented 4 years ago

The data is too large for me to narrow down, but I can try. However, I get

/Data/Vector/Generic/Mutable.hs:697 (read): index out of bounds (10,10)

with a bin size of 10 for data ranging from 0.0 to 747.0564541606117 with the custom set range of those values (I set the range equal to the minimum and maximum of the list). Is there a rounding issue here?

GregorySchwartz commented 4 years ago

Using ceiling for the upper bounds resolves the issue, so there must be something wrong with the calculation of the last bin.

Shimuuar commented 4 years ago

Floating point strikes again. Here is reproducer:

> (\hi -> histogram_ 10 0 hi (U.fromList [hi::Double]) :: U.Vector Double) 747.0564541606117
*** Exception: ./Data/Vector/Generic/Mutable.hs:697 (read): index out of bounds (10,10)

Problem is when upper limit of histogram is set to maximum value of sample latter could go to N+1 bin which out of range. I'm not sure how to fix this.

GregorySchwartz commented 4 years ago

If it's just a floating point issue, then we can assume it's N+1 for this case always? If so, can we just clamp it to the max number of bins?

Shimuuar commented 4 years ago

Not quite. histogram_ is underspecified for out of range inputs. What should it do in following case?

histogram_ 10 0 1 [2]

In histogram-fill I had special under/overflow bins. Here original semantics should be kept. Anything out of specified range should throw exception. Only question is how to calculate bins

GregorySchwartz commented 4 years ago

How about clamp if it's within floating point precision error?

Shimuuar commented 4 years ago

I think that's what have been tried. It didn't quite work out:

https://github.com/bos/statistics/blob/6aedd2dd7c595b308c4a005fec96029fd6df3dbe/Statistics/Sample/Histogram.hs#L78

I think it's viable but requires very accurate implementation

Shimuuar commented 4 years ago

Not to mention that this approach is plain wrong. It pretends to work for any RealFrac but constant is for Doubles

GregorySchwartz commented 4 years ago

What about a special case of the final bin if the element is equal to the upper bound?