scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
315 stars 67 forks source link

Unable to read ntuple with std::vector after doing 'hadd' #38

Closed jiafulow closed 6 years ago

jiafulow commented 6 years ago

I ran into a strange case where I could not read a ntuple with std::vector after doing 'hadd'. I created a test ntuple to reproduce the issue. It contains one event and has several branches:

- v_int16 : std::vector<int16_t>
- v_int32 : std::vector<int32_t>
- v_int64 : std::vector<int64_t>
- v_uint16: std::vector<uint16_t>
- v_uint32: std::vector<uint32_t>
- v_uint64: std::vector<uint64_t>
- v_bool  : std::vector<bool>
- v_float : std::vector<float>
- v_double: std::vector<double>

In uproot 2.5.10, I tried the following and it worked:

>>> f = uproot.open('stlvector.root')
>>> print f.keys()
['ntupler;1']
>>> tree = f['ntupler/tree']
>>> print tree.keys()
['v_int16', 'v_int32', 'v_int64', 'v_uint16', 'v_uint32', 'v_uint64', 'v_bool', 'v_float', 'v_double']
>>> print tree.array('v_int16')
[[1 2 3]]
>>> print tree.array('v_int32')
[[1 2 3]]
>>> print tree.array('v_int64')
[[1 2 3]]
>>> print tree.array('v_uint16')
[[1 2 3]]
>>> print tree.array('v_uint32')
[[1 2 3]]
>>> print tree.array('v_uint64')
[[1 2 3]]
>>> print tree.array('v_bool')
[[False  True]]
>>> print tree.array('v_float')
[[ 999. -999.]]
>>> print tree.array('v_double')
[[ 999. -999.]]

However, if I use hadd [1], I got errors (like [2] for std::vector<int16>) for doing the same thing as above:

>>> f = uproot.open('stlvector_after_hadd.root')
>>> print f.keys()
['ntupler;1']
>>> tree = f['ntupler/tree']
>>> print tree.keys()
['v_int16', 'v_int32', 'v_int64', 'v_uint16', 'v_uint32', 'v_uint64', 'v_bool', 'v_float', 'v_double']
>>> print tree.array('v_int16')
ValueError: cannot interpret branch 'v_int16' as a Python type
>>> print tree.array('v_int32')
ValueError: cannot interpret branch 'v_int32' as a Python type
>>> print tree.array('v_int64')
ValueError: cannot interpret branch 'v_int64' as a Python type
>>> print tree.array('v_uint16')
ValueError: cannot interpret branch 'v_uint16' as a Python type
>>> print tree.array('v_uint32')
ValueError: cannot interpret branch 'v_uint32' as a Python type
>>> print tree.array('v_uint64')
ValueError: cannot interpret branch 'v_uint64' as a Python type
>>> print tree.array('v_bool')
ValueError: cannot interpret branch 'v_bool' as a Python type
>>> print tree.array('v_float')
ValueError: cannot interpret branch 'v_float' as a Python type
>>> print tree.array('v_double')
[[ 999. -999.], [ 999. -999.], [ 999. -999.]]

I also tried

>>> print uproot.interpret(tree['v_int16'])
None
>>> print tree['v_int16']._streamer
None

So it appears that the Streamer info is missing somehow after doing 'hadd'? It's strange... I attach the root files 'stlvector.root' and 'stlvector_after_hadd.root'. I'd appreciate it if you could look into this. Thank you!

Best regards, Jia Fu

[1]

hadd -f stlvector_after_hadd.root  stlvector.root stlvector.root stlvector.root

[2]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/jiafu/muon_work/venv/lib/python2.7/site-packages/uproot/tree.py", line 347, in array
    return self.get(branch).array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=blocking)
  File "/tmp/jiafu/muon_work/venv/lib/python2.7/site-packages/uproot/tree.py", line 933, in array
    interpretation = self._normalize_interpretation(interpretation)
  File "/tmp/jiafu/muon_work/venv/lib/python2.7/site-packages/uproot/tree.py", line 717, in _normalize_interpretation
    raise ValueError("cannot interpret branch {0} as a Python type".format(repr(self.name)))
ValueError: cannot interpret branch 'v_int16' as a Python type

[3]

# My setup
virtualenv venv
source venv/bin/activate
pip install --upgrade pip
pip install uproot
jiafulow commented 6 years ago

The root files (they are zipped because GitHub doesn't allow .root file type):

stlvector.root.zip stlvector_after_hadd.root.zip

jpivarski commented 6 years ago

Thanks! I'll try to solve this before the end of the day tomorrow.

jiafulow commented 6 years ago

It's not urgent. But thank you in advance!

jpivarski commented 6 years ago

Indeed, hadd does drop all STL streamers except for vector<double> and vector<string> (which are built into ROOT, I believe— these are two that you never have to manually build dictionaries for, unlike the others). I'd call that a bug in hadd and maybe it should be raised on the ROOT forum.

Nevertheless, files with this issue exist and we need to be able to read them. We can see the lack of most STL streamers in PyROOT:

>>> import ROOT
>>> tfile = ROOT.TFile("stlvector_after_hadd.root")
>>> tfile.ShowStreamerInfo()
... # (no STL streamers except vector<double> and vector<string>; compare to the other file)

as well as in uproot:

>>> import uproot
>>> f = uproot.open("stlvector_after_hadd.root")
>>> f.showstreamers()
... # (same issue)

The difference is also apparent when you "show" the branch info (new, undocumented method):

>>> t = f["ntupler/tree"]
>>> t.show()
v_int16                    (no streamer)              None
v_int32                    (no streamer)              None
v_int64                    (no streamer)              None
v_uint16                   (no streamer)              None
v_uint32                   (no streamer)              None
v_uint64                   (no streamer)              None
v_bool                     (no streamer)              None
v_float                    (no streamer)              None
v_double                   TStreamerSTL               asstlvector(asdtype('>f8'))

Before this bug-fix, the algorithm for assigning interpretations for branches depended on having an associated streamer, and after hadd, these branches simply don't. (I don't know how to check that in ROOT as none of the method names look like they'll tell me the branch-streamer association.)

Somehow, however, ROOT is still able to interpret these branches:

>>> ttree = tfile.Get("ntupler/tree")     # (back in PyROOT now)
>>> for x in t:
...   print list(x.v_int16)
...
[1, 2, 3]
[1, 2, 3]
[1, 2, 3]

So I searched through 100% of the data available in the file associated with this branch, and the only type-indication that ROOT might be using is the fact that the TBranch.fClassName is "vector<short>". I wasn't willing to use this string at first because if a class is split among many branches, all of the branches would have the name of the top-level class name in this field, rather than the specific data they contain (e.g. they'd all say "Event" or something). To get your TTree to work, which is a requirement because it works in ROOT, I added interpretation rules

# desperation cases, usually because streamer info was dropped by hadd
elif branch.fClassName == b"vector<bool>":
    return asstlvector(asdtype("bool"))
elif branch.fClassName == b"vector<char>":
    return asstlvector(asdtype("i1"))
elif branch.fClassName == b"vector<unsigned char>":
    return asstlvector(asdtype("u1"))
elif branch.fClassName == b"vector<short>":
    return asstlvector(asdtype("i2"))
elif branch.fClassName == b"vector<unsigned short>":
    return asstlvector(asdtype("u2"))
elif branch.fClassName == b"vector<int>":
    return asstlvector(asdtype("i4"))
elif branch.fClassName == b"vector<unsigned int>":
    return asstlvector(asdtype("u4"))
elif branch.fClassName == b"vector<long>":
    return asstlvector(asdtype("i8"))
elif branch.fClassName == b"vector<unsigned long>":
    return asstlvector(asdtype("u8"))
elif branch.fClassName == b"vector<float>":
    return asstlvector(asdtype("f4"))
elif branch.fClassName == b"vector<double>":
    return asstlvector(asdtype("f8"))

but for the reason given above, they should only work if the branch is manually declared to have type vector<short>, etc., not if the branches are created by splitting a class that contains a vector<short>.

Ordinarily, I'd be against adding code that makes a codebase more complex while only stamping out an individual case, moreover, a case generated by another bug elsewhere. (I can't believe that behavior in hadd is intentional.) However, this one is okay. The interpretation rules are supposed to be an ever-growing list, borne out by examples of ROOT files found in the wild, and besides, they're only defaults that can be manually overridden. Moreover, the effect of this hack/fix is not hidden: it can be seen in tree.show():

>>> t.show()
v_int16                    (no streamer)              asstlvector(asdtype('>i2'))
v_int32                    (no streamer)              asstlvector(asdtype('>i4'))
v_int64                    (no streamer)              asstlvector(asdtype('>i8'))
v_uint16                   (no streamer)              asstlvector(asdtype('>u2'))
v_uint32                   (no streamer)              asstlvector(asdtype('>u4'))
v_uint64                   (no streamer)              asstlvector(asdtype('>u8'))
v_bool                     (no streamer)              asstlvector(asdtype('bool'))
v_float                    (no streamer)              asstlvector(asdtype('>f4'))
v_double                   TStreamerSTL               asstlvector(asdtype('>f8'))

They don't claim (wrongly) to have a streamer, but they nevertheless have an interpretation (right column) for reasons unrelated to streamers. It's something that I'll be able to ask users to print out to diagnose future bugs.

jpivarski commented 6 years ago

The fix is released in GitHub and PyPI (pip install) as 2.5.11.

jiafulow commented 6 years ago

Thanks for the detailed explanations and fixing it so quickly! :open_mouth:

I can confirm uproot 2.5.11 is able to read my analysis ntuple (after hadd) correctly. And it's reeaaally fast compared to rootpy, which is what I have been using.

It does seems like hadd is optimizing away the streamer info for other types of std::vector , because they are not needed by ROOT. In any case, I'm happy with your solution.

jpivarski commented 6 years ago

I'm glad it's working for you! Large performance improvements with respect to PyROOT or rootpy are expected, so that's good (I'd worry if the difference wasn't evident).

Labeling something as a bug or a feature is subjective, but I'd say this behavior in hadd is a bug. Since the streamer info is small compared to and doesn't grow with event data, dropping parts of this information is not really an optimization, especially since it must be ambiguous in the case of vectors in a split class (what I described above).

In fact, hadd has a difficult job: it has to merge two potentially different sets of streamer infos (no duplicates) and putting them into mutual dependency order. It looks like they got that wrong when STL collections are concerned.

jpivarski commented 6 years ago

Oh, and since you're working with STL vectors, consider installing Numba if you haven't already. When we read STL vectors, we have to touch each event individually (to remove a header), which is done in a Python for loop if you don't have Numba and in compiled code if you do. For large ntuples (where the time spent processing the ntuple exceeds the half second or so spent compiling the code), you should see an additional speedup.

But if it's fast enough and installing Numba is a bother, don't bother.

jiafulow commented 6 years ago

Ok, I'll make a bug report regarding hadd to the ROOT team and see what they would say. But I won't insist that they fix it.

Regarding Numba, what do I have to do to activate it? Do I simply put import numba at the beginning and that's it? I'm doing this on the LPC machine, and it seems Numba has already been installed.

>>> import numba
>>> print numba.__version__
0.33.0

My next question is off-topic. But I'm trying to port my rootpy script to uproot. I'm analyzing a list of hits, and each hit has its variables like endcap, station, ring. phi, theta, etc. The way I store them in my ntuple is the following:

- vh_endcap    : std::vector<short>
- vh_station   : std::vector<short>
- vh_ring      : std::vector<short>
- vh_sim_phi   : std::vector<float>
- vh_sim_theta : std::vector<float>
etc

In the example you gave at LPC about nested structures, you did the following to analyze a list of electrons:

    for event in events:
        mht_px = 0.0
        mht_py = 0.0
        mht_pz = 0.0
        for electron in event.Electron:
            mht_px += electron.pt * cosh(electron.eta) * sin(electron.phi)
            mht_py += electron.pt * cosh(electron.eta) * cos(electron.phi)
            mht_pz += electron.pt * sinh(electron.eta)

I would like to do the same, but unfortunately I'm using std::vector instead of TClonesArray, so I don't know how to "zip" and loop through all the hit variables together (unless I explicitly call zip() on all of my std::vector's).

What is the closest thing I can do? In rootpy, I use this nice function tree.define_collection which allows me to do:

tree.define_collection(name='hits', prefix='vh_', size='vh_size')

for evt in tree:
  for hit in evt.hits:
    print hit.endcap, hit.station, hit.ring, hit.sim_phi, hit.sim_theta

I would appreciate any suggestion, thanks!

jpivarski commented 6 years ago

Regarding Numba, there's no explicit activation. If import numba does not result in an ImportError, then uproot is already using it to accelerate the branch-reading. So that's contributing to the speedup you already see.

Actually, it's CMSSW that's bundled with Numba; it's not installed by default on CMSLPC.

jpivarski commented 6 years ago

The nested structures thing wasn't a difference between TClonesArray and vector. Both of these types of branches are interpreted as JaggedArrays, so they behave exactly the same way.

The nested structures talk was actually using two codebases together: uproot to read ROOT data and oamap (object-array mapping). Since then, I've rewritten uproot (so that you can use it with vector now) and I'm in the process of rewriting oamap as well. There isn't a combination of versions of the two packages that (a) interface with one another and (b) give you the vector that you want.

rootpy's define_collection function looks nice; I may implement something similar for uproot (but without modifying the tree object). For now, you'd have to create a big zip of them all.

However, that doesn't have to be a chore: it looks like you want to combine all the branches that start with "vh_". Okay, there's at least one difference between TClonesArray and vector: a collection of vector branches do not have an external size branch like "vh_size", since they can in principle have different sizes. In fact, you didn't have to make a TCloneArray to get branches with a common size branch: you can make them like

tree->Branch("vh_size", &vh_size, "vh_size/I");
tree->Branch("vh_endcap", &vh_endcap, "vh_endcap[vh_size]/F");
// etc...

But that's a digression. You could get a bunch of branch names starting with "vh_" with something like

hitnames = tree.allkeys(filtername=lambda x: x.startswith("vh_"))

And then extract only these arrays with

arrays = tree.arrays(hitnames, outputtype=tuple)

where the outputtype=tuple ensures that arrays is a tuple in the same order as hitnames. Then to zip over all of them, do

zip(*arrays)

which effectively makes each array in arrays an argument in zip. Like this:

>>> for event in zip(*arrays):
...   print "new event"
...   for name, data in zip(hitnames, event):
...     print name, data
... 
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False  True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False  True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False  True]
v_float [ 999. -999.]
v_double [ 999. -999.]

In the future, let's move these discussions to the uproot users Google Group.

jpivarski commented 6 years ago

Later, when you need more speed (because you've got your process set up— all exploration done), you'll probably want to put this loop inside of a compiled Numba function, which might complain about the use of zip(*arrays). At that point, you'll have to switch it to longhand, but one-line Python commands can spit out the Python code you'd want to copy-paste into such a script.

jiafulow commented 6 years ago

Thanks a lot for answering all my questions. I'll use the Google Group in the future.

I thought the fact that you could loop over for electron in event.Electron and access electron.pt, electron.eta, etc was because of TClonesArray. But apparently it was something else (oamap). I think that would be a really nice feature to have.

Meanwhile, the snippet that you posted will do the job. That's much better than me writing a really long zip statement. So, thanks again for all your help!