Closed bglenardo closed 4 years ago
asgenobj
means that the objects can't be deserialized directly into NumPy arrays, that iteration in Python is necessary. As in #371, it would be easier to read this out if the file can be generated with splitting turned on, both in the sense of "easier because Python doesn't need to iterate over the data, and therefore faster" and in the sense of "it uses custom constructs that maybe Uproot doesn't know how to deserialize." Both for faster deserialization and greater certainty that the case is covered by Uproot, turn splitting on when writing the file.
I was looking at the file to fix Uproot for this case (to get it to deserialize slowly, rather than not at all), and it doesn't seem to be ROOT serialized. That's the IndexError: while trying to deserialize these objects, following the prescription set by the TStreamerInfo, the file pointer gets sent to a crazy position (268436067). The reading is not simply offset by a few bytes by an earlier mistake—there are no offsets that give you data like
field name | data type (big endian) |
---|---|
fTileId | int32 |
fxTile | float32 |
fyTile | float32 |
fXPosition | float32 |
fYPosition | float32 |
fChannelLocalId | int32 |
fChannelCharge | float32 |
fChannelInductionAmplitude | float32 |
fChannelFirstTime | float32 |
fChannelLatestTime | float32 |
fChannelTime | float32 |
fChannelNTE | int32 |
fChannelNoiseTag | int32 |
fInductionAmplitude | float32 |
fNoiseOn | bool |
fWFLen | u4 |
fWFChannelCharge | float32 |
fWFAmplitude | array of int16 |
fNoiseWF | array of int16 |
Instead, there are batches of int32s, followed by batches of float32s, like:
0, 0, 0, 5, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 7,
0, 0, 0, 8, 0, 0, 0, 20, 0, 0, 0, 21, 0, 0, 0, 22, 0, 0, 0, 24, 0, 0, 0, 25
and
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 16, 0, 0, 195, 16, 0, 0, 195, 16, 0, 0, 67, 168, 0, 0, 67, 168, 0, 0,
(which is -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -144., -144., -144., 336., 336
when converted to floats).
ROOT serialization doesn't put the variables of the same type in batches within a TBranch like this (one might call it "splitting within a branch"). Within a TBranch, we should see the integers and floats interleaved in the order described by the TStreamerInfo (the table above).
I've seen this before, in #403, in which the data serialization ignored the TStreamers and serialized with Boost. Is your file similar to the one described in that issue?
Deserializing Boost-in-ROOT is beyond Uproot's scope. In principle, you need the original C++ methods to do that, since the Boost serialization is described in code.
Alternatively, if you can just write the data with ROOT's splitLevel turned on, each field would be in a separate TBranch. Then it wouldn't just be possible to deserialize, it would also use NumPy rather than falling back to asgenobj
, the slow-Python mode.
Thanks very much for your quick reply, and the helpful explanation! I'm not sure how the data were being serialized, but after some digging through our software framework, we found that the splitlevel was being set to 1 when branches were created. Changing this to 99 allows the splitting to happen, and enables me access the ElecChannel
branch members with uproot
.
However, this seems to cause another problem in one of our other data structures. We have another class called SimEvent
in a different tree in the output files, and when we change the branch splitting to 99, the uproot
conversion to a pandas.DataFrame
always fails with a ValueErrror
at a specific place in the output file. What I mean is:
TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4560)
works fine, but if I run
TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4561)
I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-7916aa49e7de> in <module>
1 TFile = up.open('g4only_5000evts_xe127_seed_1_branching_2.root')
2 events = TFile['Event/Sim/SimEvent']
----> 3 df = events.arrays('*',entrystart=0,entrystop=4561)
4 #df = events.pandas.df()
5
~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in arrays(self, branches, outputtype, namedecode, entrystart, entrystop, flatten, flatname, awkwardlib, cache, basketcache, keycache, executor, blocking)
515
516 # start the job of filling the arrays
--> 517 futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
518
519 # make functions that wait for the filling job to be done and return the right outputtype
~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in <listcomp>(.0)
515
516 # start the job of filling the arrays
--> 517 futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
518
519 # make functions that wait for the filling job to be done and return the right outputtype
~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in array(self, interpretation, entrystart, entrystop, flatten, awkwardlib, cache, basketcache, keycache, executor, blocking)
1420 if executor is None:
1421 for j in range(basketstop - basketstart):
-> 1422 _delayedraise(fill(j))
1423 excinfos = ()
1424 else:
~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in _delayedraise(excinfo)
56 exec("raise cls, err, trc")
57 else:
---> 58 raise err.with_traceback(trc)
59
60 def _filename_explode(x):
~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in fill(j)
1413 basket_itemoffset[j + 1],
1414 basket_entryoffset[j],
-> 1415 basket_entryoffset[j + 1])
1416
1417 except Exception:
~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/numerical.py in fill(self, source, destination, itemstart, itemstop, entrystart, entrystop)
65
66 def fill(self, source, destination, itemstart, itemstop, entrystart, entrystop):
---> 67 destination.reshape(-1)[itemstart:itemstop] = source.reshape(-1)
68
69 def clip(self, destination, itemstart, itemstop, entrystart, entrystop):
ValueError: could not broadcast input array from shape (2280) into shape (3420)
I get the same failure, always at event 4561, no matter what the simulation inputs are (for example, same error if I change the random seed). The failure is not present when the branch splitlevel is 1. Is this also an issue with serialization / how we're writing the data to disk?
I've uploaded an example file here: https://drive.google.com/file/d/1eLCWBRifcZ6tDcagPM42E08GMmCzHsxM/view?usp=sharing
Thanks again!
The short, and probably good, news is that it only affects a TBranch you don't care about. The one TBranch that is giving you this error is named "fBits"
and it's one of the two metadata TBranches that TObjects carry with them (along with "fUniqueID"
). You can ignore these by replacing the "*"
with
[x for x in events.allkeys() if events[x].interpretation is not None and x != b"fBits"]
The reason you saw the error at a particular event number didn't have anything to do with that event; it was the threshold where you read out more than one TBasket of "fBits"
. This TBasket was coming out with the wrong size.
Specifically, these are the TBasket data sizes that Uproot predicts:
[events["fBits"].basket_uncompressedbytes(i) for i in range(events["fBits"].numbaskets)]
[22808, 22808, 4408]
and these are the sizes that come out:
[events["fBits"].basket(i).nbytes for i in range(events["fBits"].numbaskets)]
[9120, 9120, 1760]
The prediction is exactly actual * 2.5 + 8
larger. Uproot's prediction is just reading fObjlen
from each TBasket TKey, which ought to be and always has been the "length of the uncompressed object." In this case, it's wrong.
Was this file actually produced by Geant? Geant has its own ROOT file writer, and it's possible that it gets some things wrong, such as this fObjlen
, which isn't strictly needed in event-at-a-time mode (how ROOT is usually tested), only Uproot's array-at-a-time mode (because we have to preallocate the array before filling it). It's possible that the error was never "stumbled over," so to speak.
I could put in specialized logic to predict the size of non-jagged numerical types as the number of entries times the item size, but that would assume that we trust the number of entries more than fObjlen
when they're in conflict, and it won't work for non-numerical or jagged data. Maybe if I see the same error again from another user, indicating that it's common.
I don't believe this file was produced directly by Geant4. We are using a software framework called SNiPER (developed for the JUNO experiment), and the framework appears to handle all the ROOT I/O. I myself am just getting started with it, so I don't know the intricate details. But I will forward this information to more knowledgable people.
In the mean time, I think we can work around this using the prescription you suggest. Thanks very much!
I'm guessing I can close this? Let me know if I'm wrong.
As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot4#38. (This comment is a form message I'm writing on all 6 issues.)
As of PR scikit-hep/uproot4#87, we can now detect such cases, so at least we'll raise a NotImplementedError
instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot4#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)
Hi,
I have a use case in which we have a branch which is a vector of a custom class objects (in our case, the class is called
ElecChannel
), and I would like to access individual members of this object (which include somevector<short>
objects). If I try following the instructions from Issue #371 , I get anIndexError
:Using
show
tells me that the fElecChannels branch is being interpreted as a generic object, though I'm not sure exactly what that means:Is there a way I can read this out into arrays?
An example file can be found here
Thanks so much!