Binning and histogramming: Revisiting the "open on the right" interval question

SimonHeybrock commented 1 month ago

A long time ago we decided that the bin edges provided to algorithms such as bin and hist should exclude the right bin edge. The argument was that this makes all bins consistent and, first and foremost, ensures that we can self-consistently concat the results of binning data in ranges A, B with B, C.

In practice however it is surprising behavior to users that one cannot use the max of the data as the stop in a linspace to create bin edges. Furthermore, this behavior is different from, say, numpy.histogram.

I am thus wondering if the gained self-consistency may be outweighed by the downsides? The number of times users will run into the consistency problem is likely much lower than the number of times they run into the downsides.

It should be noted that without ensuring the aforementioned consistency, a concat followed by a bin or hist can drop data! This is because the algorithm assumes that the existing bin edge to not "lie", but after concat some data can be on the wrong side of B.

nvaytet commented 3 weeks ago

I think including data on the upper edge would make sense, especially since you can use max() to make bin edges.

a concat followed by a bin or hist can drop data!

Not sure I followed everything. Initially, I thought data may be included twice, instead of being dropped...

SimonHeybrock commented 3 weeks ago

a concat followed by a bin or hist can drop data!

Not sure I followed everything. Initially, I thought data may be included twice, instead of being dropped...

The problem is that bin assumes that outer coords do not lie. It may thus assume that data is in the correct bin already, and not move it to another bin. Or it may consider it as "outside" the (current) bin and thus drop it. Or it could move it to the correct bin. See the warning here: https://scipp.github.io/generated/functions/scipp.bin.html.

scipp / scipp

Binning and histogramming: Revisiting the "open on the right" interval question #3566