scikit-hep / hist

Histogramming for analysis powered by boost-histogram
https://hist.readthedocs.io
BSD 3-Clause "New" or "Revised" License
128 stars 25 forks source link

[BUG] counts() is modified after operating on histogram #531

Closed bfonta closed 1 year ago

bfonta commented 1 year ago

Describe the bug

When applying an operation on a histogram, I would expect .counts to always return the "number of entries in a bin" (ref).

Steps to reproduce

Using Python 3.11.4 and hist 2.7.1:

import hist
from hist import Hist

h = Hist(hist.axis.Regular(10, -5, 5, name="x"))
h.fill([0.5, 1.2, 1.5, 2.5])
print(h.counts())
print(h.values())
h = h * 2
print()
print(h.counts())
print(h.values())

which outputs:

[0. 0. 0. 0. 0. 1. 2. 1. 0. 0.]
[0. 0. 0. 0. 0. 1. 2. 1. 0. 0.]

[0. 0. 0. 0. 0. 2. 4. 2. 0. 0.]
[0. 0. 0. 0. 0. 2. 4. 2. 0. 0.]

while I would naively expect:

[0. 0. 0. 0. 0. 1. 2. 1. 0. 0.]
[0. 0. 0. 0. 0. 1. 2. 1. 0. 0.]

[0. 0. 0. 0. 0. 1. 2. 1. 0. 0.]
[0. 0. 0. 0. 0. 2. 4. 2. 0. 0.]
henryiii commented 1 year ago

You have a simple storage, counts and values are exactly the same. You have to use a Mean or WeighedMean storage to have separate counts and values. Those don't support multiplication, actually (you can manually build the array & set the view with it, but the * operator isn't supported).

bfonta commented 1 year ago

Let us assume I access a histogram that I did not create myself, using a simple storage, and which I know has been scaled. Wouldn't it be useful to know with how many entries the histogram had originally been filled? Given that counts() and values() are already defined, this information could potentially be added without significant changes (as far as I can think of). To me (but I may be wrong), the output of counts() as it is now can be misleading. I suppose one could alternatively drop the support for the * operator, but it seems a nice feature to have.

henryiii commented 1 year ago

You don't have the information with a simple storage. Simple storage means one value is stored per bin. You can't keep separate counts and values without keeping more information. If you keep more information by using more values, then you should use one of the complex storages.

You can tell if something's been scaled or modified in any way, though - if simple storages's .variances() returns None, then the histogram was scaled, had at least one value set, or was loaded from data. Or a weight was used when filling. The variances only returns the values array if the histogram has only been filled with simple values and not modified.

FYI, counts is mostly there because values and counts are different things for a Mean or WeightedMean storage. There, "values" refers to the mean of the input sample values, while counts is related to the fill.