scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
233 stars 73 forks source link

Sometimes inconsistent values in hist conversion from TProfile read via uproot #1175

Open raymondEhlers opened 5 months ago

raymondEhlers commented 5 months ago

When I convert TProfile objects to hist that I've read from a ROOT file with uproot, I sometimes observe inconsistent values. For some input files, they're consistent, but for others, they're inconsistent - It depends on the input file. Unfortunately, I haven't figured out how to reproduce this in a simple way, so I've just attached two files which are produced in the same way and exhibit this behavior: (failing.root.txt (11MB), passing.root.txt (< 1 MB)). Both appear to be valid output files beyond this issue. The code below illustrates the behavior:

def test_uproot_profile_consistency_with_hist() -> None:
    import hist
    import numpy as np
    import uproot

    input_file = Path("failing.root")

    # Open and extract with uproot
    with uproot.open(input_file) as f:
        # Cast to a base class. Actual class is here: https://github.com/alisw/AliPhysics/blob/master/PWG/EMCAL/EMCALbase/AliEmcalList.h
        output_list = f["AliAnalysisTaskTrackSkim_pythia"].bases[0]
        x_sec_uproot = next(h for h in output_list if h.name == "fHistXsection")
        x_sec_uproot_hist = x_sec_uproot.to_hist()

    # Cross check with ROOT
    ROOT = pytest.importorskip("ROOT")
    with ROOT.TFile.Open(str(input_file), "READ") as f_ROOT:
        output_list = f_ROOT.Get("AliAnalysisTaskTrackSkim_pythia")
        # Cast to a base class. Actual class is here: https://github.com/alisw/AliPhysics/blob/master/PWG/EMCAL/EMCALbase/AliEmcalList.h
        output_list = ROOT.bind_object(ROOT.addressof(output_list), "TList")

        x_sec_temp = output_list.FindObject("fHistXsection")
        x_sec_temp.SetDirectory(0)
    x_sec_ROOT_values = np.array([x_sec_temp.GetBinContent(i) for i in range(1, x_sec_temp.GetNbinsX() + 1)], dtype=np.float64)

    print(f"Uproot: {uproot.__version__}, hist: {hist.__version__}")

    # The standard uproot values before conversion appear correct
    np.testing.assert_allclose(x_sec_uproot.values(), x_sec_ROOT_values)
    # Fails here for `failing.root`, but fine for `passing.root`
    np.testing.assert_allclose(x_sec_uproot.values(), x_sec_uproot_hist.values())

It strikes as odd that it depends on the particular file when they were generated the same way (with different inputs). I wonder if it's due to some issue with accessing the base class/the cast that I do. It's required because the profile is stored in a TList derived class that's part of our experimental software stack (note that I don't care about any of the additional info stored in the derived class. It's available here if helpful). I'm concerned this is perhaps corrupting an expected memory layout or otherwise causing an issue. Since this class is unfortunately driven by an experiment software stack constraint, it's difficult for me to avoid. If there's a better way for me to workaround this, I'd be happy to use that instead.

Bottom line: I suspect there might be an uproot issue here, but I'm concerned that it's instead due to an experimental software stack quirk. Help here is greatly appreciated - thanks!

>>> import uproot, hist
... print(f"Uproot: {uproot.__version__}, hist: {hist.__version__}")
Uproot: 5.3.1, hist: 2.7.2
jpivarski commented 5 months ago

You were involved in #908, so you know about #1000 (and #1148 is someone else running into it): I'm not convinced that ROOT's TProfile and boost-histogram's WeightedMean Storage are the same thing.

It might be more helpful—you might see a difference between failing.root and passing.root—if you look at the raw quantities on that are stored in the file, the members of the Model_TProfile_v*.

The TH1D base class, which has values (direct bin contents):

https://github.com/scikit-hep/uproot5/blob/21735bf36a3d8b8deb5d60025a11b23c7a2b0424/src/uproot/models/TH.py#L3881-L3891

fBinEntries, which is another per-bin quantity:

https://github.com/scikit-hep/uproot5/blob/21735bf36a3d8b8deb5d60025a11b23c7a2b0424/src/uproot/models/TH.py#L3892-L3894

and fBinSumw2:

https://github.com/scikit-hep/uproot5/blob/21735bf36a3d8b8deb5d60025a11b23c7a2b0424/src/uproot/models/TH.py#L3902-L3904

The way that the values and errors of the profile are computed from these raw quantities is not at all trivial:

https://github.com/scikit-hep/uproot5/blob/21735bf36a3d8b8deb5d60025a11b23c7a2b0424/src/uproot/behaviors/TProfile.py#L63-L129

raymondEhlers commented 5 months ago

Thanks for the pointers Jim! I had missed #1148 - apologies.

After #1000, I did some validation with multiple files and it seemed to be working on individual TProfiles, so I was a bit surprised that some files work while others don't. However, I now see your point that the calculation is quite involved, which may explain the differences. I'll plan to dig into this further when I have some free moments