IndexError while reading a vector of custom class objects from tree

Hi,

I have a use case in which we have a branch which is a vector of a custom class objects (in our case, the class is called ElecChannel), and I would like to access individual members of this object (which include some vector<short> objects). If I try following the instructions from Issue #371 , I get an IndexError:

elec = TFile['Event/Elec/ElecEvent']
obj = elec['fElecChannels'].array()[0]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-17-4e71d5389b06> in <module>
      1 elec = TFile['Event/Elec/ElecEvent']
      2 print(elec.show())
----> 3 obj = elec['fElecChannels'].array()[0]
      4 
      5 #branch = elec['ElecEvent']['fElecChannels']

~/localpythonpackages/lib/python3.7/site-packages/awkward/array/objects.py in __getitem__(self, where)
    191         if self._util_isinteger(head):
    192             if isinstance(tail, tuple) and tail == ():
--> 193                 return self.generator(content, *self._args, **self._kwargs)
    194             else:
    195                 return self.generator(content, *self._args, **self._kwargs)[tail]

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/objects.py in __call__(self, arg)
    377             source = uproot.source.source.Source(bytes)
    378             cursor = uproot.source.cursor.Cursor(0, origin=origin)
--> 379             return self.cls.read(source, cursor, self.context, None)
    380         def __repr__(self):
    381             if isinstance(self.cls, type):

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/objects.py in read(self, source, cursor, context, parent)
     68             out = [None] * numitems
     69             for i in range(numitems):
---> 70                 out[i] = self.cls.read(source, cursor, context, parent)
     71             return out
     72 

~/localpythonpackages/lib/python3.7/site-packages/uproot/rootio.py in read(cls, source, cursor, context, parent)
    963             context = context.copy()
    964         out = cls.__new__(cls)
--> 965         out = cls._readinto(out, source, cursor, context, parent)
    966         out._postprocess(source, cursor, context, parent)
    967         return out

~/localpythonpackages/lib/python3.7/site-packages/uproot/rootio.py in _readinto(cls, self, source, cursor, context, parent, asclass)

~/localpythonpackages/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, source, length, dtype)
     59         start = self.index
     60         stop = self.index = start + length*dtype.itemsize
---> 61         return source.data(start, stop, dtype)
     62 
     63     def string(self, source):

~/localpythonpackages/lib/python3.7/site-packages/uproot/source/source.py in data(self, start, stop, dtype)
     39 
     40         if stop > len(self._source):
---> 41             raise IndexError("indexes {0}:{1} are beyond the end of data source of length {2}".format(start, stop, len(self._source)))
     42 
     43         if dtype is None:

IndexError: indexes 611:268436067 are beyond the end of data source of length 4787

Using show tells me that the fElecChannels branch is being interpreted as a generic object, though I'm not sure exactly what that means:

ElecEvent                  TStreamerInfo              None
nEXO::EventObject          TStreamerInfo              asgenobj(nEXO_3a3a_EventObject)
fElecChannels              TStreamerSTL               asgenobj(STLVector(nEXO_3a3a_ElecChannel))
fNTE                       TStreamerBasicType         asdtype('>i4')
fEnergy                    TStreamerBasicType         asdtype('>f4')
fmc_charge                 TStreamerSTL               asjagged(asdtype('>f4'), 10)
fmc_tepos                  TStreamerSTL               None

Is there a way I can read this out into arrays?

An example file can be found here

Thanks so much!

asgenobj means that the objects can't be deserialized directly into NumPy arrays, that iteration in Python is necessary. As in #371, it would be easier to read this out if the file can be generated with splitting turned on, both in the sense of "easier because Python doesn't need to iterate over the data, and therefore faster" and in the sense of "it uses custom constructs that maybe Uproot doesn't know how to deserialize." Both for faster deserialization and greater certainty that the case is covered by Uproot, turn splitting on when writing the file.

I was looking at the file to fix Uproot for this case (to get it to deserialize slowly, rather than not at all), and it doesn't seem to be ROOT serialized. That's the IndexError: while trying to deserialize these objects, following the prescription set by the TStreamerInfo, the file pointer gets sent to a crazy position (268436067). The reading is not simply offset by a few bytes by an earlier mistake—there are no offsets that give you data like

field name	data type (big endian)
fTileId	int32
fxTile	float32
fyTile	float32
fXPosition	float32
fYPosition	float32
fChannelLocalId	int32
fChannelCharge	float32
fChannelInductionAmplitude	float32
fChannelFirstTime	float32
fChannelLatestTime	float32
fChannelTime	float32
fChannelNTE	int32
fChannelNoiseTag	int32
fInductionAmplitude	float32
fNoiseOn	bool
fWFLen	u4
fWFChannelCharge	float32
fWFAmplitude	array of int16
fNoiseWF	array of int16

Instead, there are batches of int32s, followed by batches of float32s, like:

0, 0, 0, 5, 0, 0, 0,  2, 0, 0, 0,  3, 0, 0, 0,  4, 0, 0, 0,  6, 0, 0, 0, 7,
0, 0, 0, 8, 0, 0, 0, 20, 0, 0, 0, 21, 0, 0, 0, 22, 0, 0, 0, 24, 0, 0, 0, 25

and

195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0, 195, 168, 0, 0,
195,  16, 0, 0, 195,  16, 0, 0, 195,  16, 0, 0,  67, 168, 0, 0,  67, 168, 0, 0,

(which is -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -336., -144., -144., -144., 336., 336 when converted to floats).

ROOT serialization doesn't put the variables of the same type in batches within a TBranch like this (one might call it "splitting within a branch"). Within a TBranch, we should see the integers and floats interleaved in the order described by the TStreamerInfo (the table above).

I've seen this before, in #403, in which the data serialization ignored the TStreamers and serialized with Boost. Is your file similar to the one described in that issue?

Deserializing Boost-in-ROOT is beyond Uproot's scope. In principle, you need the original C++ methods to do that, since the Boost serialization is described in code.

Alternatively, if you can just write the data with ROOT's splitLevel turned on, each field would be in a separate TBranch. Then it wouldn't just be possible to deserialize, it would also use NumPy rather than falling back to asgenobj, the slow-Python mode.

Thanks very much for your quick reply, and the helpful explanation! I'm not sure how the data were being serialized, but after some digging through our software framework, we found that the splitlevel was being set to 1 when branches were created. Changing this to 99 allows the splitting to happen, and enables me access the ElecChannel branch members with uproot.

However, this seems to cause another problem in one of our other data structures. We have another class called SimEvent in a different tree in the output files, and when we change the branch splitting to 99, the uproot conversion to a pandas.DataFrame always fails with a ValueErrror at a specific place in the output file. What I mean is:

TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4560)

works fine, but if I run

TFile = up.open('g4only_5000evts_xe127_seed_1.root')
events = TFile['Event/Sim/SimEvent']
df = events.arrays('*',outputtype=pd.DataFrame,entrystart=0,entrystop=4561)

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-7916aa49e7de> in <module>
      1 TFile = up.open('g4only_5000evts_xe127_seed_1_branching_2.root')
      2 events = TFile['Event/Sim/SimEvent']
----> 3 df = events.arrays('*',entrystart=0,entrystop=4561)
      4 #df = events.pandas.df()
      5 

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in arrays(self, branches, outputtype, namedecode, entrystart, entrystop, flatten, flatname, awkwardlib, cache, basketcache, keycache, executor, blocking)
    515 
    516         # start the job of filling the arrays
--> 517         futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
    518 
    519         # make functions that wait for the filling job to be done and return the right outputtype

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in <listcomp>(.0)
    515 
    516         # start the job of filling the arrays
--> 517         futures = [(branch.name if namedecode is None else branch.name.decode(namedecode), interpretation, branch.array(interpretation=interpretation, entrystart=entrystart, entrystop=entrystop, flatten=(flatten and not ispandas), awkwardlib=awkward, cache=cache, basketcache=basketcache, keycache=keycache, executor=executor, blocking=False)) for branch, interpretation in branches]
    518 
    519         # make functions that wait for the filling job to be done and return the right outputtype

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in array(self, interpretation, entrystart, entrystop, flatten, awkwardlib, cache, basketcache, keycache, executor, blocking)
   1420         if executor is None:
   1421             for j in range(basketstop - basketstart):
-> 1422                 _delayedraise(fill(j))
   1423             excinfos = ()
   1424         else:

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in _delayedraise(excinfo)
     56             exec("raise cls, err, trc")
     57         else:
---> 58             raise err.with_traceback(trc)
     59 
     60 def _filename_explode(x):

~/localpythonpackages/lib/python3.7/site-packages/uproot/tree.py in fill(j)
   1413                                     basket_itemoffset[j + 1],
   1414                                     basket_entryoffset[j],
-> 1415                                     basket_entryoffset[j + 1])
   1416 
   1417             except Exception:

~/localpythonpackages/lib/python3.7/site-packages/uproot/interp/numerical.py in fill(self, source, destination, itemstart, itemstop, entrystart, entrystop)
     65 
     66     def fill(self, source, destination, itemstart, itemstop, entrystart, entrystop):
---> 67         destination.reshape(-1)[itemstart:itemstop] = source.reshape(-1)
     68 
     69     def clip(self, destination, itemstart, itemstop, entrystart, entrystop):

ValueError: could not broadcast input array from shape (2280) into shape (3420)

I get the same failure, always at event 4561, no matter what the simulation inputs are (for example, same error if I change the random seed). The failure is not present when the branch splitlevel is 1. Is this also an issue with serialization / how we're writing the data to disk?

I've uploaded an example file here: https://drive.google.com/file/d/1eLCWBRifcZ6tDcagPM42E08GMmCzHsxM/view?usp=sharing

Thanks again!

The short, and probably good, news is that it only affects a TBranch you don't care about. The one TBranch that is giving you this error is named "fBits" and it's one of the two metadata TBranches that TObjects carry with them (along with "fUniqueID"). You can ignore these by replacing the "*" with

[x for x in events.allkeys() if events[x].interpretation is not None and x != b"fBits"]

The reason you saw the error at a particular event number didn't have anything to do with that event; it was the threshold where you read out more than one TBasket of "fBits". This TBasket was coming out with the wrong size.

Specifically, these are the TBasket data sizes that Uproot predicts:

[events["fBits"].basket_uncompressedbytes(i) for i in range(events["fBits"].numbaskets)]
[22808, 22808, 4408]

and these are the sizes that come out:

[events["fBits"].basket(i).nbytes for i in range(events["fBits"].numbaskets)]
[9120, 9120, 1760]

The prediction is exactly actual * 2.5 + 8 larger. Uproot's prediction is just reading fObjlen from each TBasket TKey, which ought to be and always has been the "length of the uncompressed object." In this case, it's wrong.

Was this file actually produced by Geant? Geant has its own ROOT file writer, and it's possible that it gets some things wrong, such as this fObjlen, which isn't strictly needed in event-at-a-time mode (how ROOT is usually tested), only Uproot's array-at-a-time mode (because we have to preallocate the array before filling it). It's possible that the error was never "stumbled over," so to speak.

I could put in specialized logic to predict the size of non-jagged numerical types as the number of entries times the item size, but that would assume that we trust the number of entries more than fObjlen when they're in conflict, and it won't work for non-numerical or jagged data. Maybe if I see the same error again from another user, indicating that it's common.

I don't believe this file was produced directly by Geant4. We are using a software framework called SNiPER (developed for the JUNO experiment), and the framework appears to handle all the ROOT I/O. I myself am just getting started with it, so I don't know the intricate details. But I will forward this information to more knowledgable people.

In the mean time, I think we can work around this using the prescription you suggest. Thanks very much!

I'm guessing I can close this? Let me know if I'm wrong.

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot4#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot4#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot4#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)

scikit-hep / uproot3

IndexError while reading a vector of custom class objects from tree #475