scikit-hep / uproot3-methods

Pythonic behaviors for non-I/O related ROOT classes.
BSD 3-Clause "New" or "Revised" License
21 stars 28 forks source link

Question on behavior of weights and variances #97

Closed matthewfeickert closed 3 years ago

matthewfeickert commented 3 years ago

From the discussions on Gitter today motivated by https://github.com/matthewfeickert/heputils/issues/24 I think I am confused by the behavior of uproot3-methods treatment of weights and variances. Here is a short example

$ cat requirements.txt 
uproot~=4.0.6
uproot3~=3.14.4
$ pip list | grep "uproot"
uproot            4.0.6
uproot3           3.14.4
uproot3-methods   0.10.0
# issue.py
import numpy as np
import uproot
import uproot3

if __name__ == "__main__":
    bins = np.arange(0, 8)
    counts = np.array([2 ** x for x in range(len(bins[:-1]))])
    with uproot3.recreate("test.root", compression=uproot3.ZLIB(4)) as outfile:
        outfile["data"] = (counts, bins)

    root_file = uproot.open("test.root")
    hist = root_file["data"]

    print(f"hist has weights: {hist.weighted}")
    print(f"hist values: {hist.values()}")
    print("\nvariances are square of uncertainties\n")
    print(f"hist errors: {hist.errors()}")
    print(f"hist variances: {hist.variances()}")
    assert hist.variances().tolist() == np.square(hist.errors()).tolist()
    print("\nbut errors are not sqrt of values\n")

    print(f"expected errors to be: {np.sqrt(hist.values())}")
$ python issue.py 
hist has weights: True
hist values: [ 1  2  4  8 16 32 64]

variances are square of uncertainties

hist errors: [ 1.  2.  4.  8. 16. 32. 64.]
hist variances: [1.000e+00 4.000e+00 1.600e+01 6.400e+01 2.560e+02 1.024e+03 4.096e+03]

but errors are not sqrt of values

expected errors to be: [1.         1.41421356 2.         2.82842712 4.         5.65685425
 8.        ]

As help for .errors gives

Help on method errors in module uproot.behaviors.TH1:

errors(flow=False) method of uproot.dynamic.Model_TH1I_v3 instance
    Args:
        flow (bool): If True, include underflow and overflow bins before and
            after the normal (finite-width) bins.

    Errors (uncertainties) in the :ref:`uproot.behaviors.TH1.Histogram.values`
    as a 1, 2, or 3 dimensional ``numpy.ndarray`` of ``numpy.float64``.

    If ``fSumw2`` (weights) are available, they will be used in the
    calculation of the errors. If not, errors are assumed to be the square
    root of the values.

    Setting ``flow=True`` increases the length of each dimension by two.

I think(?) this is due to the behavior of

https://github.com/scikit-hep/uproot3-methods/blob/b722ee6b872a32587f61b9d97fc3bcb5d3b2353a/uproot3_methods/classes/TH1.py#L347-L353

where it seems that the weights are taken to be the values of the NumPy array — this is not what I would have expected.

What is the proper way to create an uproot3 TH1 with Poisson uncertainties?

jpivarski commented 3 years ago

The confusing thing about this (I believe this is what you're hitting) is that the TH1 is a NumPy array. That array corresponds to the values (bin contents) of the histogram. It might also have a _fSumw2 attribute attached to it, which is another NumPy array, that corresponds to the errors.

The issue is that ROOT's TH1 is a subclass of TArray, and in Uproot 3 I made TArray a NumPy array subclass. This confusion motivated a less direct modeling of C++ classes in Uproot 4, in which Python objects representing C++ objects contain their superclasses instead of inheriting from them.

Not only did this introduce the strange semantics in which the histogram values are not an attribute; they are the object, but it also made other subclasses of TH1, such as TProfile, have incorrect values and errors (because we didn't explicitly implement it, and it inherited TH1's methods).

At least, that's what I think you're running into above.

henryiii commented 3 years ago

The problem is this line:

out._fSumw2 = valuesarray ** 2

Which sets fSumw2 assuming that the values in the ndarray were made with a single fill with a large weight, rather than out._fSumw2 = valuesarray, which would assume all fills were made with weight=1 and would match ROOT's behavior when adding weights. (I don't see an obvious path for if the ndarray already has an fSumw2 attribute attached, but didn't check closely).

matthewfeickert commented 3 years ago

The confusing thing about this (I believe this is what you're hitting) is that the TH1 is a NumPy array. That array corresponds to the values (bin contents) of the histogram. It might also have a _fSumw2 attribute attached to it, which is another NumPy array, that corresponds to the errors.

Hm, it seems like this will always happen though given that it is present here using only uproot3

# uproot3_only.py
import numpy as np
import uproot3

if __name__ == "__main__":
    bins = np.arange(0, 8)
    counts = np.array([2 ** x for x in range(len(bins[:-1]))])
    with uproot3.recreate("test.root", compression=uproot3.ZLIB(4)) as outfile:
        outfile["data"] = (counts, bins)

    root_file = uproot3.open("test.root")
    hist = root_file["data"]

    print(f"hist._fSumw2 ({type(hist._fSumw2)}): {hist._fSumw2}")
    print(f"hist values ({type(hist.values)}): {hist.values}")
    print(f"hist variances ({type(hist.variances)}): {hist.variances}")
$ python uproot3_only.py 
hist._fSumw2 (<class 'uproot3.rootio.TArrayD'>): [0.0, 1.0, 4.0, 16.0, 64.0, 256.0, 1024.0, 4096.0, 0.0]
hist values (<class 'numpy.ndarray'>): [ 1  2  4  8 16 32 64]
hist variances (<class 'numpy.ndarray'>): [1.000e+00 4.000e+00 1.600e+01 6.400e+01 2.560e+02 1.024e+03 4.096e+03]

So won't it always have _fSumw2 from

https://github.com/scikit-hep/uproot3-methods/blob/b722ee6b872a32587f61b9d97fc3bcb5d3b2353a/uproot3_methods/classes/TH1.py#L353

and then the variances are set from it? :/ I may be missing something obvious, but from what Henry has pointed out it seems like this will always be the case if you're writing a histogram from numpy arrays.

kratsg commented 3 years ago

@henryiii to be clear, is this just boiling down to whether one needs to do sum(values)**2 versus sum(x**2 for x in values) ?

matthewfeickert commented 3 years ago

I think I now understand that this is also highlighting a difference in how ROOT and uproot3 write files given that

# example_pyroot.py
import uproot
from ROOT import TH1F, TFile, kFALSE, kTRUE

def write_root_file(filename):
    n_bins = 7
    hist = TH1F("data", "", n_bins, 0, n_bins)
    hist.SetStats(kFALSE)

    for bin_idx in range(0, n_bins):
        hist.SetBinContent(bin_idx + 1, 2 ** bin_idx)

    hist.Sumw2(kTRUE)

    write_file = TFile(filename, "recreate")
    hist.Write()
    write_file.Close()

if __name__ == "__main__":
    write_root_file("pyroot_output.root")

    root_file = uproot.open("pyroot_output.root")
    hist = root_file["data"]

    print(f"hist has weights: {hist.weighted}")
    print(f"hist values: {hist.values()}")
    print(f"hist errors: {hist.errors()}")
    print(f"hist variances: {hist.variances()}")

gives

$ python example_pyroot.py 
hist has weights: True
hist values: [ 1.  2.  4.  8. 16. 32. 64.]
hist errors: [1.         1.41421356 2.         2.82842712 4.         5.65685425
 8.        ]
hist variances: [ 1.  2.  4.  8. 16. 32. 64.]

I need to think on this more, as I obviously don't understand the problem as well as everyone else, but would changing the behavior of TH1's from_numpy be enough and be correct?

henryiii commented 3 years ago

The call hist.Sumw2(kTRUE) sets the variances equal to the entries. Then the error is the square root of the entries. Uproot is setting the variance equal to the square of the entries, making the error equal to the entries. I think it's a simple bug due to the fact "sumw2" has a square in it, so it looks like it should be set with a square.

If you have a bin with 5 entries, the usual thing to do is assume that it was filled 5 times, so sumw2 = 1² + 1² + 1² + 1² + 1² = 5. Then the error is √5. Usually you do not assume that this was filled once with weight 5 (which would be with error 5).

henryiii commented 3 years ago

I think the above answers @kratsg's question - yes, when it says "sumw2", it's each weight that's squared, not the sum of them, so setting this with valuesarray ** 2 is only correct if there was only one weight in the sum.

matthewfeickert commented 3 years ago

@jpivarski If you're happy with the solution Henry poses above I have a branch on uproot3-methods that implements this that I'll make a PR from.

jpivarski commented 3 years ago

Sorry—just catching up. Yes: this sounds like a good solution, though I'll understand it better when I see it as a diff in a PR. Thanks!