Deserialization error in AsStridedObjects but not AsObjects for an example with split level 0.

scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.

BSD 3-Clause "New" or "Revised" License

233 stars 74 forks source link

>>> import uproot >>> uproot.__version__ '4.0.4' >>> f = uproot.open("split_1.objectwise.root") >>> f["T/whatever"].show() name | typename | interpretation ---------------------+--------------------------+------------------------------- whatever | TWhatever | AsGroup(<TBranchElement 'whateTObject | unknown | <UnknownInterpretation 'non... a | double | AsDtype('>f8') b | int32_t | AsDtype('>i4') >>> f["T/whatever/a"].array()[:10] <Array [0, 10.1, 20.2, ... 70.7, 80.8, 90.9] type='10 * float64'> >>> f = uproot.open("nosplit.objectwise.root") >>> f["T/whatever"].show() name | typename | interpretation ---------------------+--------------------------+------------------------------- whatever | TWhatever | AsStridedObjects(Model_TWhatev >>> f["T/whatever"].array() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-7ac5bffa76fd> in <module> ----> 1 f["T/whatever"].array() ~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library) 2070 ranges_or_baskets.append((branch, basket_num, range_or_basket)) 2071 -> 2072 _ranges_or_baskets_to_arrays( 2073 self, 2074 ranges_or_baskets, ~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays) 3458 3459 elif isinstance(obj, tuple) and len(obj) == 3: -> 3460 uproot.source.futures.delayed_raise(*obj) 3461 3462 else: ~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/source/futures.py in delayed_raise(exception_class, exception_value, traceback) 44 exec("raise exception_class, exception_value, traceback") 45 else: ---> 46 raise exception_value.with_traceback(traceback) 47 48 ~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in basket_to_array(basket) 3415 ) 3416 if basket.num_entries != len(basket_arrays[basket.basket_num]): -> 3417 raise ValueError( 3418 """basket {0} in tree/branch {1} has the wrong number of entries """ 3419 """(expected {2}, obtained {3}) when interpreted as {4} ValueError: basket 0 in tree/branch /T;1:whatever has the wrong number of entries (expected 1064, obtained 836) when interpreted as AsStridedObjects(Model_TWhatever_v5) in file nosplit.objectwise.root

One suggestive thing is that the split version has a whatever/TObject that couldn't be interpreted. I wonder if that's some kind of header.

>>> tree = uproot.open("split_1.objectwise.root:T")
>>> tree.show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsGroup(<TBranchElement 'wh...
whatever/TObject     | unknown                  | <UnknownInterpretation 'non...
whatever/a           | double                   | AsDtype('>f8')
whatever/b           | int32_t                  | AsDtype('>i4')

Another hint is that the unsplit version can be interpreted AsObjects but not AsStridedObjects. The auto-determined interpretation is AsStridedObjects:

>>> tree = uproot.open("nosplit.objectwise.root:T")
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsStridedObjects(Model_TWha...

which we can do explicitly using uproot.interpretation.identify.interpretation_of.

>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {})
AsStridedObjects(Model_TWhatever_v5)

(I'm passing {} as the context because this object isn't deep; it probably doesn't need all the information about how we got to this point in deserialization. Oh, I could have just passed TBranch.context. That would have been better, but this is okay.)

Now let's remove the simplification step that replaces AsObjects with AsStridedObjects.

>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False)
AsObjects(Model_TWhatever)
>>> tree["whatever"].array(uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False))
<Array [{a: 0, b: 0}, ... b: -5000}] type='5001 * TWhatever["a": float64, "b": i...'>

Aha! We can deserialize it! It's slow (there's a noticeable lag with 5001 elements), but it does bracket the error between AsObjects and AsStridedObjects.

The difference between AsObjects and AsStridedObjects is that AsObjects walks through Python loops, element by element, byte by byte, and AsStridedObjects casts the buffer as a NumPy structured array, then pulls each field out using field-access. Objects with different field lengths, like some bools, some 32-bit integers, and some 64-bit floats, can be interpreted by this striding, and we use that to read it much more quickly. Objects with variable-length fields, such as strings or std::vector, can't, and we have to fall back to AsObjects. This might become moot when AwkwardForth is introduced (AwkwardForth can deal with variable-length data and might be as fast as striding), but it isn't yet.

It could be that, because of the TObject header that we don't know how to interpret when split, it is incorrect to simplify this particular AsObjects to AsStridedObjects. In other words, the bug could be in the rules that decide whether to simplify it.

On the other hand, it could be that we can read this by striding, but are currently doing it incorrectly. That would require more research.

scikit-hep / uproot5

Deserialization error in AsStridedObjects but not AsObjects for an example with split level 0. #275