Closed jiafulow closed 6 years ago
The root files (they are zipped because GitHub doesn't allow .root file type):
Thanks! I'll try to solve this before the end of the day tomorrow.
It's not urgent. But thank you in advance!
Indeed, hadd
does drop all STL streamers except for vector<double>
and vector<string>
(which are built into ROOT, I believe— these are two that you never have to manually build dictionaries for, unlike the others). I'd call that a bug in hadd
and maybe it should be raised on the ROOT forum.
Nevertheless, files with this issue exist and we need to be able to read them. We can see the lack of most STL streamers in PyROOT:
>>> import ROOT
>>> tfile = ROOT.TFile("stlvector_after_hadd.root")
>>> tfile.ShowStreamerInfo()
... # (no STL streamers except vector<double> and vector<string>; compare to the other file)
as well as in uproot:
>>> import uproot
>>> f = uproot.open("stlvector_after_hadd.root")
>>> f.showstreamers()
... # (same issue)
The difference is also apparent when you "show" the branch info (new, undocumented method):
>>> t = f["ntupler/tree"]
>>> t.show()
v_int16 (no streamer) None
v_int32 (no streamer) None
v_int64 (no streamer) None
v_uint16 (no streamer) None
v_uint32 (no streamer) None
v_uint64 (no streamer) None
v_bool (no streamer) None
v_float (no streamer) None
v_double TStreamerSTL asstlvector(asdtype('>f8'))
Before this bug-fix, the algorithm for assigning interpretations for branches depended on having an associated streamer, and after hadd
, these branches simply don't. (I don't know how to check that in ROOT as none of the method names look like they'll tell me the branch-streamer association.)
Somehow, however, ROOT is still able to interpret these branches:
>>> ttree = tfile.Get("ntupler/tree") # (back in PyROOT now)
>>> for x in t:
... print list(x.v_int16)
...
[1, 2, 3]
[1, 2, 3]
[1, 2, 3]
So I searched through 100% of the data available in the file associated with this branch, and the only type-indication that ROOT might be using is the fact that the TBranch.fClassName
is "vector<short>"
. I wasn't willing to use this string at first because if a class is split among many branches, all of the branches would have the name of the top-level class name in this field, rather than the specific data they contain (e.g. they'd all say "Event"
or something). To get your TTree to work, which is a requirement because it works in ROOT, I added interpretation rules
# desperation cases, usually because streamer info was dropped by hadd
elif branch.fClassName == b"vector<bool>":
return asstlvector(asdtype("bool"))
elif branch.fClassName == b"vector<char>":
return asstlvector(asdtype("i1"))
elif branch.fClassName == b"vector<unsigned char>":
return asstlvector(asdtype("u1"))
elif branch.fClassName == b"vector<short>":
return asstlvector(asdtype("i2"))
elif branch.fClassName == b"vector<unsigned short>":
return asstlvector(asdtype("u2"))
elif branch.fClassName == b"vector<int>":
return asstlvector(asdtype("i4"))
elif branch.fClassName == b"vector<unsigned int>":
return asstlvector(asdtype("u4"))
elif branch.fClassName == b"vector<long>":
return asstlvector(asdtype("i8"))
elif branch.fClassName == b"vector<unsigned long>":
return asstlvector(asdtype("u8"))
elif branch.fClassName == b"vector<float>":
return asstlvector(asdtype("f4"))
elif branch.fClassName == b"vector<double>":
return asstlvector(asdtype("f8"))
but for the reason given above, they should only work if the branch is manually declared to have type vector<short>
, etc., not if the branches are created by splitting a class that contains a vector<short>
.
Ordinarily, I'd be against adding code that makes a codebase more complex while only stamping out an individual case, moreover, a case generated by another bug elsewhere. (I can't believe that behavior in hadd
is intentional.) However, this one is okay. The interpretation rules are supposed to be an ever-growing list, borne out by examples of ROOT files found in the wild, and besides, they're only defaults that can be manually overridden. Moreover, the effect of this hack/fix is not hidden: it can be seen in tree.show()
:
>>> t.show()
v_int16 (no streamer) asstlvector(asdtype('>i2'))
v_int32 (no streamer) asstlvector(asdtype('>i4'))
v_int64 (no streamer) asstlvector(asdtype('>i8'))
v_uint16 (no streamer) asstlvector(asdtype('>u2'))
v_uint32 (no streamer) asstlvector(asdtype('>u4'))
v_uint64 (no streamer) asstlvector(asdtype('>u8'))
v_bool (no streamer) asstlvector(asdtype('bool'))
v_float (no streamer) asstlvector(asdtype('>f4'))
v_double TStreamerSTL asstlvector(asdtype('>f8'))
They don't claim (wrongly) to have a streamer, but they nevertheless have an interpretation (right column) for reasons unrelated to streamers. It's something that I'll be able to ask users to print out to diagnose future bugs.
The fix is released in GitHub and PyPI (pip install
) as 2.5.11.
Thanks for the detailed explanations and fixing it so quickly! :open_mouth:
I can confirm uproot 2.5.11 is able to read my analysis ntuple (after hadd) correctly. And it's reeaaally fast compared to rootpy, which is what I have been using.
It does seems like hadd is optimizing away the streamer info for other types of std::vector , because they are not needed by ROOT. In any case, I'm happy with your solution.
I'm glad it's working for you! Large performance improvements with respect to PyROOT or rootpy are expected, so that's good (I'd worry if the difference wasn't evident).
Labeling something as a bug or a feature is subjective, but I'd say this behavior in hadd
is a bug. Since the streamer info is small compared to and doesn't grow with event data, dropping parts of this information is not really an optimization, especially since it must be ambiguous in the case of vectors in a split class (what I described above).
In fact, hadd
has a difficult job: it has to merge two potentially different sets of streamer infos (no duplicates) and putting them into mutual dependency order. It looks like they got that wrong when STL collections are concerned.
Oh, and since you're working with STL vectors, consider installing Numba if you haven't already. When we read STL vectors, we have to touch each event individually (to remove a header), which is done in a Python for loop if you don't have Numba and in compiled code if you do. For large ntuples (where the time spent processing the ntuple exceeds the half second or so spent compiling the code), you should see an additional speedup.
But if it's fast enough and installing Numba is a bother, don't bother.
Ok, I'll make a bug report regarding hadd to the ROOT team and see what they would say. But I won't insist that they fix it.
Regarding Numba, what do I have to do to activate it? Do I simply put import numba
at the beginning and that's it? I'm doing this on the LPC machine, and it seems Numba has already been installed.
>>> import numba
>>> print numba.__version__
0.33.0
My next question is off-topic. But I'm trying to port my rootpy script to uproot. I'm analyzing a list of hits, and each hit has its variables like endcap, station, ring. phi, theta, etc. The way I store them in my ntuple is the following:
- vh_endcap : std::vector<short>
- vh_station : std::vector<short>
- vh_ring : std::vector<short>
- vh_sim_phi : std::vector<float>
- vh_sim_theta : std::vector<float>
etc
In the example you gave at LPC about nested structures, you did the following to analyze a list of electrons:
for event in events:
mht_px = 0.0
mht_py = 0.0
mht_pz = 0.0
for electron in event.Electron:
mht_px += electron.pt * cosh(electron.eta) * sin(electron.phi)
mht_py += electron.pt * cosh(electron.eta) * cos(electron.phi)
mht_pz += electron.pt * sinh(electron.eta)
I would like to do the same, but unfortunately I'm using std::vector
instead of TClonesArray
, so I don't know how to "zip" and loop through all the hit variables together (unless I explicitly call zip() on all of my std::vector
's).
What is the closest thing I can do? In rootpy, I use this nice function tree.define_collection
which allows me to do:
tree.define_collection(name='hits', prefix='vh_', size='vh_size')
for evt in tree:
for hit in evt.hits:
print hit.endcap, hit.station, hit.ring, hit.sim_phi, hit.sim_theta
I would appreciate any suggestion, thanks!
Regarding Numba, there's no explicit activation. If import numba
does not result in an ImportError
, then uproot is already using it to accelerate the branch-reading. So that's contributing to the speedup you already see.
Actually, it's CMSSW that's bundled with Numba; it's not installed by default on CMSLPC.
The nested structures thing wasn't a difference between TClonesArray and vector
The nested structures talk was actually using two codebases together: uproot to read ROOT data and oamap (object-array mapping). Since then, I've rewritten uproot (so that you can use it with vector
rootpy's define_collection
function looks nice; I may implement something similar for uproot (but without modifying the tree
object). For now, you'd have to create a big zip
of them all.
However, that doesn't have to be a chore: it looks like you want to combine all the branches that start with "vh_"
. Okay, there's at least one difference between TClonesArray and vector"vh_size"
, since they can in principle have different sizes. In fact, you didn't have to make a TCloneArray to get branches with a common size branch: you can make them like
tree->Branch("vh_size", &vh_size, "vh_size/I");
tree->Branch("vh_endcap", &vh_endcap, "vh_endcap[vh_size]/F");
// etc...
But that's a digression. You could get a bunch of branch names starting with "vh_"
with something like
hitnames = tree.allkeys(filtername=lambda x: x.startswith("vh_"))
And then extract only these arrays with
arrays = tree.arrays(hitnames, outputtype=tuple)
where the outputtype=tuple
ensures that arrays
is a tuple in the same order as hitnames
. Then to zip over all of them, do
zip(*arrays)
which effectively makes each array in arrays
an argument in zip
. Like this:
>>> for event in zip(*arrays):
... print "new event"
... for name, data in zip(hitnames, event):
... print name, data
...
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.]
new event
v_int16 [1 2 3]
v_int32 [1 2 3]
v_int64 [1 2 3]
v_uint16 [1 2 3]
v_uint32 [1 2 3]
v_uint64 [1 2 3]
v_bool [False True]
v_float [ 999. -999.]
v_double [ 999. -999.]
In the future, let's move these discussions to the uproot users Google Group.
Later, when you need more speed (because you've got your process set up— all exploration done), you'll probably want to put this loop inside of a compiled Numba function, which might complain about the use of zip(*arrays)
. At that point, you'll have to switch it to longhand, but one-line Python commands can spit out the Python code you'd want to copy-paste into such a script.
Thanks a lot for answering all my questions. I'll use the Google Group in the future.
I thought the fact that you could loop over for electron in event.Electron
and access electron.pt
, electron.eta
, etc was because of TClonesArray. But apparently it was something else (oamap). I think that would be a really nice feature to have.
Meanwhile, the snippet that you posted will do the job. That's much better than me writing a really long zip
statement. So, thanks again for all your help!
I ran into a strange case where I could not read a ntuple with
std::vector
after doing 'hadd'. I created a test ntuple to reproduce the issue. It contains one event and has several branches:In uproot 2.5.10, I tried the following and it worked:
However, if I use hadd [1], I got errors (like [2] for
std::vector<int16>
) for doing the same thing as above:I also tried
So it appears that the Streamer info is missing somehow after doing 'hadd'? It's strange... I attach the root files 'stlvector.root' and 'stlvector_after_hadd.root'. I'd appreciate it if you could look into this. Thank you!
Best regards, Jia Fu
[1]
[2]
[3]