scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

IndexError while reading a vector of custom class objects from tree #475

Closed bglenardo closed 4 years ago

bglenardo commented 4 years ago

Hi,

I have a use case in which we have a branch which is a vector of a custom class objects (in our case, the class is called ElecChannel), and I would like to access individual members of this object (which include some vector<short> objects). If I try following the instructions from Issue #371 , I get an IndexError:

elec = TFile['Event/Elec/ElecEvent']
obj = elec['fElecChannels'].array()[0]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-17-4e71d5389b06> in <module>
      1 elec = TFile['Event/Elec/ElecEvent']
      2 print(elec.show())
----> 3 obj = elec['fElecChannels'].array()[0]
      4 
      5 #branch = elec['ElecEvent']['fElecChannels']

~/localpythonpackages/lib/python3.7/site-packages/awkward/array/objects.py in __getitem__(self, where)
    191         if self._util_isinteger(head):
    192             if isinstance(tail, tuple) and tail == ():
--> 193                 return self.generator(content, *self._args, **self._kwargs)
    194             else:
    195                 return self.generator(content, *self._args, **self._kwargs)[tail]

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/objects.py in __call__(self, arg)
    377             source = uproot.source.source.Source(bytes)
    378             cursor = uproot.source.cursor.Cursor(0, origin=origin)
--> 379             return self.cls.read(source, cursor, self.context, None)
    380         def __repr__(self):
    381             if isinstance(self.cls, type):

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/objects.py in read(self, source, cursor, context, parent)
     68             out = [None] * numitems
     69             for i in range(numitems):
---> 70                 out[i] = self.cls.read(source, cursor, context, parent)
     71             return out
     72 

~/localpythonpackages/lib/python3.7/site-packages/uproot/rootio.py in read(cls, source, cursor, context, parent)
    963             context = context.copy()
    964         out = cls.__new__(cls)
--> 965         out = cls._readinto(out, source, cursor, context, parent)
    966         out._postprocess(source, cursor, context, parent)
    967         return out

~/localpythonpackages/lib/python3.7/site-packages/uproot/rootio.py in _readinto(cls, self, source, cursor, context, parent, asclass)

~/localpythonpackages/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, source, length, dtype)
     59         start = self.index
     60         stop = self.index = start + length*dtype.itemsize
---> 61         return source.data(start, stop, dtype)
     62 
     63     def string(self, source):

~/localpythonpackages/lib/python3.7/site-packages/uproot/source/source.py in data(self, start, stop, dtype)
     39 
     40         if stop > len(self._source):
---> 41             raise IndexError("indexes {0}:{1} are beyond the end of data source of length {2}".format(start, stop, len(self._source)))
     42 
     43         if dtype is None:

IndexError: indexes 611:268436067 are beyond the end of data source of length 4787

Using show tells me that the fElecChannels branch is being interpreted as a generic object, though I'm not sure exactly what that means:

ElecEvent                  TStreamerInfo              None
nEXO::EventObject          TStreamerInfo              asgenobj(nEXO_3a3a_EventObject)
fElecChannels              TStreamerSTL               asgenobj(STLVector(nEXO_3a3a_ElecChannel))
fNTE                       TStreamerBasicType         asdtype('>i4')
fEnergy                    TStreamerBasicType         asdtype('>f4')
fmc_charge                 TStreamerSTL               asjagged(asdtype('>f4'), 10)
fmc_tepos                  TStreamerSTL               None

Is there a way I can read this out into arrays?

An example file can be found here

Thanks so much!

jpivarski commented 4 years ago

asgenobj means that the objects can't be deserialized directly into NumPy arrays, that iteration in Python is necessary. As in #371, it would be easier to read this out if the file can be generated with splitting turned on, both in the sense of "easier because Python doesn't need to iterate over the data, and therefore faster" and in the sense of "it uses custom constructs that maybe Uproot doesn't know how to deserialize." Both for faster deserialization and greater certainty that the case is covered by Uproot, turn splitting on when writing the file.

I was looking at the file to fix Uproot for this case (to get it to deserialize slowly, rather than not at all), and it doesn't seem to be ROOT serialized. That's the IndexError: while trying to deserialize these objects, following the prescription set by the TStreamerInfo, the file pointer gets sent to a crazy position (268436067). The reading is not simply offset by a few bytes by an earlier mistake—there are no offsets that give you data like

field name data type (big endian)
fTileId int32
fxTile float32
fyTile float32
fXPosition float32
fYPosition float32
fChannelLocalId int32
fChannelCharge float32
fChannelInductionAmplitude float32
fChannelFirstTime float32
fChannelLatestTime float32
fChannelTime float32
fChannelNTE int32
fChannelNoiseTag int32
fInductionAmplitude float32
fNoiseOn bool
fWFLen u4
fWFChannelCharge float32
fWFAmplitude array of int16
fNoiseWF array of int16

Instead, there are batches of int32s, followed by batches of float32s, like:

0, 0, 0, 5, 0, 0, 0,  2, 0, 0, 0,  3, 0, 0, 0,  4, 0, 0, 0,  6, 0, 0, 0, 7,
0, 0, 0, 8, 0, 0, 0, 20, 0, 0, 0, 21, 0, 0, 0, 22, 0, 0, 0, 24, 0, 0, 0, 25

and

195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195,  16, 0, 0, 195,  16, 0, 0, 195,  16, 0, 0,  67, 168, 0, 0,  67, 168, 0, 0,

(which is -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -144., -144., -144., 336., 336 when converted to floats).

ROOT serialization doesn't put the variables of the same type in batches within a TBranch like this (one might call it "splitting within a branch"). Within a TBranch, we should see the integers and floats interleaved in the order described by the TStreamerInfo (the table above).

I've seen this before, in #403, in which the data serialization ignored the TStreamers and serialized with Boost. Is your file similar to the one described in that issue?

Deserializing Boost-in-ROOT is beyond Uproot's scope. In principle, you need the original C++ methods to do that, since the Boost serialization is described in code.

Alternatively, if you can just write the data with ROOT's splitLevel turned on, each field would be in a separate TBranch. Then it wouldn't just be possible to deserialize, it would also use NumPy rather than falling back to asgenobj, the slow-Python mode.

bglenardo commented 4 years ago

Thanks very much for your quick reply, and the helpful explanation! I'm not sure how the data were being serialized, but after some digging through our software framework, we found that the splitlevel was being set to 1 when branches were created. Changing this to 99 allows the splitting to happen, and enables me access the ElecChannel branch members with uproot.

However, this seems to cause another problem in one of our other data structures. We have another class called SimEvent in a different tree in the output files, and when we change the branch splitting to 99, the uproot conversion to a pandas.DataFrame always fails with a ValueErrror at a specific place in the output file. What I mean is:

TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4560)

works fine, but if I run

TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4561)

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-7916aa49e7de> in <module>
      1 TFile = up.open('g4only_5000evts_xe127_seed_1_branching_2.root')
      2 events = TFile['Event/Sim/SimEvent']
----> 3 df = events.arrays('*',entrystart=0,entrystop=4561)
      4 #df = events.pandas.df()
      5 

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in arrays(self, branches, outputtype, namedecode, entrystart, entrystop, flatten, flatname, awkwardlib, cache, basketcache, keycache, executor, blocking)
    515 
    516         # start the job of filling the arrays
--> 517         futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
    518 
    519         # make functions that wait for the filling job to be done and return the right outputtype

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in <listcomp>(.0)
    515 
    516         # start the job of filling the arrays
--> 517         futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
    518 
    519         # make functions that wait for the filling job to be done and return the right outputtype

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in array(self, interpretation, entrystart, entrystop, flatten, awkwardlib, cache, basketcache, keycache, executor, blocking)
   1420         if executor is None:
   1421             for j in range(basketstop - basketstart):
-> 1422                 _delayedraise(fill(j))
   1423             excinfos = ()
   1424         else:

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in _delayedraise(excinfo)
     56             exec("raise cls, err, trc")
     57         else:
---> 58             raise err.with_traceback(trc)
     59 
     60 def _filename_explode(x):

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in fill(j)
   1413                                     basket_itemoffset[j + 1],
   1414                                     basket_entryoffset[j],
-> 1415                                     basket_entryoffset[j + 1])
   1416 
   1417             except Exception:

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/numerical.py in fill(self, source, destination, itemstart, itemstop, entrystart, entrystop)
     65 
     66     def fill(self, source, destination, itemstart, itemstop, entrystart, entrystop):
---> 67         destination.reshape(-1)[itemstart:itemstop] = source.reshape(-1)
     68 
     69     def clip(self, destination, itemstart, itemstop, entrystart, entrystop):

ValueError: could not broadcast input array from shape (2280) into shape (3420)

I get the same failure, always at event 4561, no matter what the simulation inputs are (for example, same error if I change the random seed). The failure is not present when the branch splitlevel is 1. Is this also an issue with serialization / how we're writing the data to disk?

I've uploaded an example file here: https://drive.google.com/file/d/1eLCWBRifcZ6tDcagPM42E08GMmCzHsxM/view?usp=sharing

Thanks again!

jpivarski commented 4 years ago

The short, and probably good, news is that it only affects a TBranch you don't care about. The one TBranch that is giving you this error is named "fBits" and it's one of the two metadata TBranches that TObjects carry with them (along with "fUniqueID"). You can ignore these by replacing the "*" with

[x for x in events.allkeys() if events[x].interpretation is not None and x != b"fBits"]

The reason you saw the error at a particular event number didn't have anything to do with that event; it was the threshold where you read out more than one TBasket of "fBits". This TBasket was coming out with the wrong size.

Specifically, these are the TBasket data sizes that Uproot predicts:

[events["fBits"].basket_uncompressedbytes(i) for i in range(events["fBits"].numbaskets)]
[22808, 22808, 4408]

and these are the sizes that come out:

[events["fBits"].basket(i).nbytes for i in range(events["fBits"].numbaskets)]
[9120, 9120, 1760]

The prediction is exactly actual * 2.5 + 8 larger. Uproot's prediction is just reading fObjlen from each TBasket TKey, which ought to be and always has been the "length of the uncompressed object." In this case, it's wrong.

Was this file actually produced by Geant? Geant has its own ROOT file writer, and it's possible that it gets some things wrong, such as this fObjlen, which isn't strictly needed in event-at-a-time mode (how ROOT is usually tested), only Uproot's array-at-a-time mode (because we have to preallocate the array before filling it). It's possible that the error was never "stumbled over," so to speak.

I could put in specialized logic to predict the size of non-jagged numerical types as the number of entries times the item size, but that would assume that we trust the number of entries more than fObjlen when they're in conflict, and it won't work for non-numerical or jagged data. Maybe if I see the same error again from another user, indicating that it's common.

bglenardo commented 4 years ago

I don't believe this file was produced directly by Geant4. We are using a software framework called SNiPER (developed for the JUNO experiment), and the framework appears to handle all the ROOT I/O. I myself am just getting started with it, so I don't know the intricate details. But I will forward this information to more knowledgable people.

In the mean time, I think we can work around this using the prescription you suggest. Thanks very much!

jpivarski commented 4 years ago

I'm guessing I can close this? Let me know if I'm wrong.

jpivarski commented 4 years ago

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot4#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot4#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot4#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)