Open tamasgal opened 3 years ago
One suggestive thing is that the split version has a whatever/TObject
that couldn't be interpreted. I wonder if that's some kind of header.
>>> tree = uproot.open("split_1.objectwise.root:T")
>>> tree.show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
whatever | TWhatever | AsGroup(<TBranchElement 'wh...
whatever/TObject | unknown | <UnknownInterpretation 'non...
whatever/a | double | AsDtype('>f8')
whatever/b | int32_t | AsDtype('>i4')
Another hint is that the unsplit version can be interpreted AsObjects
but not AsStridedObjects
. The auto-determined interpretation is AsStridedObjects
:
>>> tree = uproot.open("nosplit.objectwise.root:T")
name | typename | interpretation
---------------------+--------------------------+-------------------------------
whatever | TWhatever | AsStridedObjects(Model_TWha...
which we can do explicitly using uproot.interpretation.identify.interpretation_of.
>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {})
AsStridedObjects(Model_TWhatever_v5)
(I'm passing {}
as the context
because this object isn't deep; it probably doesn't need all the information about how we got to this point in deserialization. Oh, I could have just passed TBranch.context. That would have been better, but this is okay.)
Now let's remove the simplification step that replaces AsObjects
with AsStridedObjects
.
>>> uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False)
AsObjects(Model_TWhatever)
>>> tree["whatever"].array(uproot.interpretation.identify.interpretation_of(tree["whatever"], {}, simplify=False))
<Array [{a: 0, b: 0}, ... b: -5000}] type='5001 * TWhatever["a": float64, "b": i...'>
Aha! We can deserialize it! It's slow (there's a noticeable lag with 5001 elements), but it does bracket the error between AsObjects
and AsStridedObjects
.
The difference between AsObjects and AsStridedObjects is that AsObjects
walks through Python loops, element by element, byte by byte, and AsStridedObjects
casts the buffer as a NumPy structured array, then pulls each field out using field-access. Objects with different field lengths, like some bools, some 32-bit integers, and some 64-bit floats, can be interpreted by this striding, and we use that to read it much more quickly. Objects with variable-length fields, such as strings or std::vector
, can't, and we have to fall back to AsObjects
. This might become moot when AwkwardForth is introduced (AwkwardForth can deal with variable-length data and might be as fast as striding), but it isn't yet.
It could be that, because of the TObject header that we don't know how to interpret when split, it is incorrect to simplify this particular AsObjects
to AsStridedObjects
. In other words, the bug could be in the rules that decide whether to simplify it.
On the other hand, it could be that we can read this by striding, but are currently doing it incorrectly. That would require more research.
I was not sure whether I should post this in https://github.com/scikit-hep/uproot4/issues/38 but at least it's not directly related to memberwise splitting, so I guess a new issue is fine
;)
.I decided to spend some time today on the memberwise-mystery and discovered that
uproot
chokes on split level 0 with a simple class.A dummy project where I started to explore the memberwise splitting can be used to reproduce the ROOT files and of course it also includes the class definition and tree configuration: https://github.com/tamasgal/root_splitting
So, back on track, I attached two files, both containing the same data and one is created with split level 0, the other with split level 1. The latter works fine, but the former with split level 0 causes problems due to some misinterpretation of the number of entries. Although split level 0 is very uncommon, maybe this sheds light on some yet not understood aspects of the serialisation. I have not looked closer, but I wanted to dump my findings...
files.zip