scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.
https://uproot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
233 stars 74 forks source link

Deserialization error in AsStridedObjects but not AsObjects for an example with split level 0. #275

Open tamasgal opened 3 years ago

tamasgal commented 3 years ago

I was not sure whether I should post this in https://github.com/scikit-hep/uproot4/issues/38 but at least it's not directly related to memberwise splitting, so I guess a new issue is fine ;).

I decided to spend some time today on the memberwise-mystery and discovered that uproot chokes on split level 0 with a simple class.

A dummy project where I started to explore the memberwise splitting can be used to reproduce the ROOT files and of course it also includes the class definition and tree configuration: https://github.com/tamasgal/root_splitting

So, back on track, I attached two files, both containing the same data and one is created with split level 0, the other with split level 1. The latter works fine, but the former with split level 0 causes problems due to some misinterpretation of the number of entries. Although split level 0 is very uncommon, maybe this sheds light on some yet not understood aspects of the serialisation. I have not looked closer, but I wanted to dump my findings...

>>> import uproot

>>> uproot.__version__
'4.0.4'

>>> f = uproot.open("split_1.objectwise.root")

>>> f["T/whatever"].show()
name                 | typename                 | interpretation
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsGroup(<TBranchElement 'whateTObject              | unknown                  | <UnknownInterpretation 'non...
a                    | double                   | AsDtype('>f8')
b                    | int32_t                  | AsDtype('>i4')

>>> f["T/whatever/a"].array()[:10]
<Array [0, 10.1, 20.2, ... 70.7, 80.8, 90.9] type='10 * float64'>

>>> f = uproot.open("nosplit.objectwise.root")

>>> f["T/whatever"].show()
name                 | typename                 | interpretation
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsStridedObjects(Model_TWhatev
>>> f["T/whatever"].array()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-7ac5bffa76fd> in <module>
----> 1 f["T/whatever"].array()

~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in array(self, interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library)
   2070                         ranges_or_baskets.append((branch, basket_num, range_or_basket))
   2071
-> 2072         _ranges_or_baskets_to_arrays(
   2073             self,
   2074             ranges_or_baskets,

~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in _ranges_or_baskets_to_arrays(hasbranches, ranges_or_baskets, branchid_interpretation, entry_start, entry_stop, decompression_executor, interpretation_executor, library, arrays)
   3458
   3459         elif isinstance(obj, tuple) and len(obj) == 3:
-> 3460             uproot.source.futures.delayed_raise(*obj)
   3461
   3462         else:

~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/source/futures.py in delayed_raise(exception_class, exception_value, traceback)
     44         exec("raise exception_class, exception_value, traceback")
     45     else:
---> 46         raise exception_value.with_traceback(traceback)
     47
     48

~/Dev/km3pipe/venv/lib/python3.8/site-packages/uproot/behaviors/TBranch.py in basket_to_array(basket)
   3415             )
   3416             if basket.num_entries != len(basket_arrays[basket.basket_num]):
-> 3417                 raise ValueError(
   3418                     """basket {0} in tree/branch {1} has the wrong number of entries """
   3419                     """(expected {2}, obtained {3}) when interpreted as {4}

ValueError: basket 0 in tree/branch /T;1:whatever has the wrong number of entries (expected 1064, obtained 836) when interpreted as AsStridedObjects(Model_TWhatever_v5)
    in file nosplit.objectwise.root

files.zip

jpivarski commented 3 years ago

One suggestive thing is that the split version has a whatever/TObject that couldn't be interpreted. I wonder if that's some kind of header.

>>> tree = uproot.open("split_1.objectwise.root:T")
>>> tree.show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsGroup(<TBranchElement 'wh...
whatever/TObject     | unknown                  | <UnknownInterpretation 'non...
whatever/a           | double                   | AsDtype('>f8')
whatever/b           | int32_t                  | AsDtype('>i4')

Another hint is that the unsplit version can be interpreted AsObjects but not AsStridedObjects. The auto-determined interpretation is AsStridedObjects:

>>> tree = uproot.open("nosplit.objectwise.root:T")
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
whatever             | TWhatever                | AsStridedObjects(Model_TWha...

which we can do explicitly using uproot.interpretation.identify.interpretation_of.

>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {})
AsStridedObjects(Model_TWhatever_v5)

(I'm passing {} as the context because this object isn't deep; it probably doesn't need all the information about how we got to this point in deserialization. Oh, I could have just passed TBranch.context. That would have been better, but this is okay.)

Now let's remove the simplification step that replaces AsObjects with AsStridedObjects.

>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False)
AsObjects(Model_TWhatever)
>>> tree["whatever"].array(uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False))
<Array [{a: 0, b: 0}, ... b: -5000}] type='5001 * TWhatever["a": float64, "b": i...'>

Aha! We can deserialize it! It's slow (there's a noticeable lag with 5001 elements), but it does bracket the error between AsObjects and AsStridedObjects.

The difference between AsObjects and AsStridedObjects is that AsObjects walks through Python loops, element by element, byte by byte, and AsStridedObjects casts the buffer as a NumPy structured array, then pulls each field out using field-access. Objects with different field lengths, like some bools, some 32-bit integers, and some 64-bit floats, can be interpreted by this striding, and we use that to read it much more quickly. Objects with variable-length fields, such as strings or std::vector, can't, and we have to fall back to AsObjects. This might become moot when AwkwardForth is introduced (AwkwardForth can deal with variable-length data and might be as fast as striding), but it isn't yet.

It could be that, because of the TObject header that we don't know how to interpret when split, it is incorrect to simplify this particular AsObjects to AsStridedObjects. In other words, the bug could be in the rules that decide whether to simplify it.

On the other hand, it could be that we can read this by striding, but are currently doing it incorrectly. That would require more research.