scalanlp / breeze

Breeze is/was a numerical processing library for Scala.
https://www.scalanlp.org
Apache License 2.0
3.45k stars 692 forks source link

Bug in calculation of histogram binEdges when step size causes omitting endpoint. #725

Closed espears4sq closed 6 years ago

espears4sq commented 6 years ago

In the algorithm for how binEdges is calculated in breeze.stats.hist, there can be cases where end + (end - start) / bins will end up being exactly equal to the final step point, and thus not included since rangeD looks for values strictly less than the endpoint.

Here's an example where using two bins ought to provide 3 locations for binEdges, but instead it only provides 2.

scala> val d = breeze.linalg.DenseVector.rand[Double](100)
d: breeze.linalg.DenseVector[Double] = DenseVector(0.30187445256406753, 0.9527732249892069, 0.9677798086386304, 0.38701921309114473, 0.18722377129426993, 0.3116747887324729, 0.994807582669387, 0.09744514946553129, 0.38201991795205203, 0.6629830324912258, 0.6712055323521247, 0.3814761287904629, 0.35476658327361577, 0.602992933836404, 0.993924039437796, 0.7528416486958354, 0.7342922105688068, 0.5123917593936091, 0.25355346567596504, 0.3870857968630168, 0.26890742042447524, 0.17787949311414009, 0.3070960467024657, 0.9859523090145996, 0.5990790745309076, 0.7118611068501155, 0.902958097469494, 0.3043128766940737, 0.48980703504530143, 0.0333563168434714, 0.37827089740375164, 0.5675383658340518, 0.5183706621489148, 0.7667001141382124, 0.7020764844078033, 0.73229078332...

scala> d.min
res812: Double = 0.014406692517867192

scala> d.max
res813: Double = 0.9968213942427224

scala> d.length
res814: Int = 100

scala> val h = breeze.stats.hist(d, 2)
h: breeze.stats.hist.Histogram[Double] = breeze.stats.hist$HistogramImpl@74a38a95

scala> h.binEdges
res815: breeze.linalg.DenseVector[Double] = DenseVector(0.014406692517867192, 0.5056140433802948)

scala> val h = breeze.stats.hist(d, 3)
h: breeze.stats.hist.Histogram[Double] = breeze.stats.hist$HistogramImpl@395fe94e

scala> h.binEdges
res816: breeze.linalg.DenseVector[Double] = DenseVector(0.014406692517867192, 0.3418782597594856, 0.669349827001104, 0.9968213942427224)

scala> breeze.linalg.DenseVector.rangeD(d.min, d.max + (d.max - d.min) / 2, (d.max - d.min) / 2)
res817: breeze.linalg.DenseVector[Double] = DenseVector(0.014406692517867192, 0.5056140433802948)

I thought it must be something to do with an integer multiple of the stepsize hitting a boundary condition, but it seems more complicated than that.

espears4sq commented 6 years ago

Ah, this appears to be a duplicate of #669 and was fixed, but I don't have the ability to use the updated version of breeze with the fix.

So the culprit is actually in this old implementation of rangeD that used floor instead of ceil with the extra step adjustment.