scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

inaccurate / inconsistent reading of ROOT file #513

Closed cshimmin closed 4 years ago

cshimmin commented 4 years ago

Hi, I'm trying to use uproot to read a ROOT file created by Delphes, and I am seeing some very strange behavior. I've uploaded a small file which demonstrates the issue: test.root (1.2Mb). Delphes uses ExRootAnalysis under the hood, which in turn is writing TClonesArrays of particle-like objects to the root file. I'm using ROOT version 6.20/06 to create this sample, and I'm reading the file with uproot v3.11.7 on python 3.6.7.

When I read this file in using the ROOT command line interpreter, everything looks as expected, but I see highly inconsistent results when I load the file with uproot. The first thing I noticed is that the field fUniqueID is not read properly by uproot. In this example, each particle's fUniqueID happens to be equal to its array index + 1, as can be verified by opening the file in root:

root [1] Delphes->Scan("Particle.fUniqueID")

***********************************
*    Row   * Instance * Particle. *
***********************************
*        0 *        0 *         1 *
*        0 *        1 *         2 *
*        0 *        2 *         3 *
*        0 *        3 *         4 *
*        0 *        4 *         5 *
*        0 *        5 *         6 *
*        0 *        6 *         7 *
*        0 *        7 *         8 *
*        0 *        8 *         9 *
*        0 *        9 *        10 *
[ .. clipped ...]

However, uproot reports 0 for all values in this branch:

f = uproot.open("test.root")
t = f["Delphes"]
(t.array("Particle.fUniqueID") == 0).all().all()

> True

This led me to realize that uproot seems to somehow be confused about how may particles there are in each event. This file has a field called Particle_size, which is an integer that is set for convenience by the tree writer to indicate the length of the corresponding TClonesArray. However, the values read out by root vs. uproot are wildly different:

root [1] Delphes->Scan("Particle_size")

************************
*    Row   * Particle_ *
************************
*        0 *      2519 *
*        1 *      2580 *
*        2 *      1992 *
*        3 *      1848 *
*        4 *      2376 *
*        5 *      1963 *
*        6 *      2300 *
*        7 *      1949 *
*        8 *      2524 *
*        9 *      2685 *
************************

This result is as expected. But when I try to look at the same branch w/ uproot I get:

f = uproot.open("test.root")
t = f["Delphes"]
t.array("Particle_size")

> array([201, 125, 111, 116, 183, 132, 140, 107, 177, 243], dtype=int32)

Digging deeper, I found something even stranger; these incorrect numbers appear to correspond to the length of some, but not all of the split branches. For instance, the Particle.fUniqueID and Particle.PT branches actually do contain far fewer entries than expected:

pt = t.array("Particle.PT")
list(map(len, pt))

> [201, 125, 111, 116, 183, 132, 140, 107, 177, 243]

However, other branches, such as Particle.PID, contain the full number of expected entries, in agreement with root:

pid =t.array("Particle.PID")
lis(map(len, pid))

> [2519, 2580, 1992, 1848, 2376, 1963, 2300, 1949, 2524, 2685]

It gets weirder. While the fUniqueID field is just filled with zeros, the PT values are nonzero:

pt[0,:10]

> array([26.841331 , 25.875156 , 15.073622 , 13.940983 , 13.5541315,
       13.457675 , 10.314984 ,  9.512343 ,  8.793333 ,  8.770421 ],
      dtype=float32)

and they're sorted in descending order!

(pt[0,:1] > pt[0,1:]).all()

> True

Although there are thousands of particles in the actual event (as when read by root), I did some spot checking and it turns out that at least some of these PT values do actually appear in the event, so they are not just bogus numbers:

root [2] Delphes->Scan("Particle.PT", "Particle.PT>8 && Particle.Status==1")

***********************************
*    Row   * Instance * Particle. *
***********************************
*        0 *     1627 * 13.554131 *
*        0 *     1820 * 8.7933330 *
*        0 *     1823 * 10.314984 *
*        0 *     1840 * 26.841331 *
*        0 *     1842 * 13.457675 *
*        0 *     1844 * 8.3317794 *
*        0 *     2052 * 15.073621 *
*        0 *     2053 * 8.7704210 *
*        0 *     2366 * 25.875156 *
*        0 *     2510 * 13.940982 *
*        0 *     2511 * 9.5123434 *

I was not able to identify an obvious commonality between these particles, except that at least first ten (and possibly all, I didn't check) are particles with status code = 1 (i.e. stable particles); however, this subset still contains fewer than the expected number of stable particles. It is also maybe not too shocking that the pt is ordered, as I believe Delphes applies in-place sorting of particle arrays at various points during its processing.

To summarize, when I try to read this file with uproot, we have some branches appearing with fewer entries than expected, and in some cases missing/incorrect values. In other cases, the values are not missing, but appear to be a (random?) subset of the expected values, and appearing in a completely different order.

I don't know a lot about ROOT I/O but I am wondering if this has to do with multiple versions or "histories" of the file buffer appearing in the ROOT file? So that perhaps e.g. the 201 particles in the first event at some point were added, and then later the branch got updated with more data, before the event was finalized? It is interesting that the integer Particle_size agrees with the size of the incomplete branches. I can point you to the line where this gets set in ExRootAnalysis: the integer that backs the *_size branch is incremented as items are added to the TClonesArray, as opposed to being set by inspecting the length just before Tree.Fill(), so this could explain why the "partial" branches agree with the Particle_size branch.

tamasgal commented 4 years ago

I have not debugged the uproot3 code yet but I had a quick look with uproot4 which Jim is currently heavily working on and the good news is that the readout seems to be correct, so this is an uproot3 issue and not a general problem with the understanding of the ROOT layout -- so to say ;)

Here you can see that the fUniqueID is continuous, the sizes are the same as those returned by ROOT and the particle IDs are valid PDG IDs:

In [1]: import uproot4

In [2]: uproot4.__version__
Out[2]: '0.0.16'

In [3]: f = uproot4.open("/Users/tamasgal/Downloads/test.root")

In [4]: f["Delphes/Particle.fUniqueID"].array()
Out[4]: <Array [[1, 2, 3, 4, ... 2683, 2684, 2685]] type='10 * var * uint32'>

In [5]: f["Delphes/Particle_size"].array()
Out[5]: <Array [2519, 2580, 1992, ... 1949, 2524, 2685] type='10 * int32'>

In [6]: f["Delphes/Particle.PID"].array()
Out[6]: <Array [[2212, 2212, 21, ... 22, 211, -211]] type='10 * var * int32'>

So maybe you can simply switch over to uproot4, even if it's not 1.0 yet, but as far as I understood, the release date is this summer.

It's already on PyPI and you also need awkward1 to access those nice arrays:

pip install uproot4 awkward1
cshimmin commented 4 years ago

Thanks for the quick reply, Tamas! I wasn't aware of uproot4.

I can confirm that the above issues are fixed in uproot4. Unfortunately there seem to be some other issues in handling jagged arrays of TRefArrays, that were not an issue in uproot3. In uproot3 these were simply treated as doubly-jagged arrays of IDs, but in uproot4 it looks like there is a TRefArray class that isn't playing nicely in a jagged array.

Without getting too far off topic (should I open a separate issue on the uproot4 tracker?), this is the problem I encountered:

f = uproot4.open("test.root")
refs = f["Delphes/GenJet.Particles"].array()

produces a "CannotBeAwkward" error, which in turn produces a ValueError:

ValueError: cannot produce Awkward Arrays for interpretation AsObjects(Model_TRefArray) because

    TRefArray

instead, try library="np" instead of library="ak"

The error goes away if I try library="np" as suggested:

refs = f["Delphes/GenJet.Particles"].array(library="np")

But now refs will contain just a single TRefArray per event (when there should be several, one for each jet). It appears to only return the first TRefArray:

list(map(len, refs))

> [45, 23, 22, 13, 11, 13, 24, 8, 48, 46]
root [1] Delphes->Scan("GenJet.Particles.GetLast()")

***********************************
*    Row   * Instance * GenJet.Pa *
***********************************
*        0 *        0 *        44 *      <-----
*        0 *        1 *        19 *
*        0 *        2 *        20 *
*        0 *        3 *        30 *
*        0 *        4 *        25 *
*        0 *        5 *        31 *
*        0 *        6 *        25 *
*        1 *        0 *        22 *      <-----
*        1 *        1 *        23 *
*        1 *        2 *        11 *
*        1 *        3 *        19 *
*        1 *        4 *        26 *
*        1 *        5 *        18 *
*        2 *        0 *        21 *      <-----
*        2 *        1 *        20 *
*        2 *        2 *        23 *
*        2 *        3 *        24 *
*        2 *        4 *        18 *
*        3 *        0 *        12 *      <-----
*        3 *        1 *        34 *
*        3 *        2 *        34 *
*        3 *        3 *        32 *
*        4 *        0 *        10 *      <-----
*        4 *        1 *        65 *
*        4 *        2 *        31 *

I understand uproot4 is still under development, so maybe I just have to wait for that version to stabilize... keep up the great work!

jpivarski commented 4 years ago

You're both too quick for me and now my message below if outdated! Uproot 4 is ready for users, though issues like the one with TRefArray are not unexpected—having your example file will help me through them. I can't remember why I concluded that the TRefArray can't be Awkward (it has a string in it, but that's not a fundamental problem), so I'll review that and get your example working in Uproot 4. It's pretty disturbing that all those errors lurked in Uproot 3, but it validates my decision to give it a complete rerwite, rethinking and cleaning up the deserialization code while porting to Awkward 1.

By "ready for users," I mean it's graduated from alpha to beta. One could do a full physics analysis in it, though there are going to be rough edges like this. I'm not in favor of "big bang" releases, since there's a chicken-and-egg problem with needing users to find bugs and needing to eliminate bugs for users to be able to work.


I'll be taking a look at this file to see all the issues first hand. As for the fUniqueID, in going over the interpretation code to translate from Uproot 3 to Uproot 4, I did see that the Uproot 3 code was wrong, but nobody has ever complained about fUniqueID before. So that should at least be different, if not correct on Uproot 4.

I'll find out when I look at the file, but might there be different cycle numbers for this TTree, if it was written and rewritten? If that's the case, you might be looking at different TTrees in your comparisons between ROOT and Uproot, where I've of an older cycle of the other. That's something I'll check, though.

The particle_size is particularly disturbing because those are different numbers by an order of magnitude, and yet not ridiculously big (as would happen in a typical deserialization error. Again, I'll look into it. Thanks for providing a file!

jpivarski commented 4 years ago

The TRefArray thing is fixed in Uproot4's 0.0.16. The reason for excluding Awkward wasn't a good one, so I worked around that.

The original issue in Uproot3 was because two branches have the name "Particles_size", so you were seeing the other one. I only found out that it's possible for two branches to have the same name after writing Uproot3, so that was written assuming that branch names are unique identifiers. That kind of assumption is hard to remove from an existing codebase, but easy to avoid in a new one. Uproot4 was almost ready for it; I just needed to make cache keys unique.

So this should work for you in Uproot4, but I don't have enough time to backport the corrections into Uproot3.

jpivarski commented 4 years ago

Oh, and if you're using Uproot4 (because Uproot3 won't work for you), this link might be helpful:

https://indico.cern.ch/event/882824/timetable/#29-uproot-awkward-arrays-tutor

This was a tutorial I gave on it last week, which includes both an interactive notebook (Binder; GitHub) and a video of the talk itself (YouTube).

cshimmin commented 4 years ago

Wow, thanks for the super quick resolution! I'll test 0.0.16 this week and post an update to confirm.