Closed tamasgal closed 4 years ago
This is just an example how the dtype could look like, notice that I am clueless about the buffer part, although I know it's an array of [("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")] structs.
If it's a variable-length list of some structure, then you can't use uproot.asdtype
or uproot.astable
. It has to be an uproot.asgenobj
, filling Python classes on demand. The Python class has a _readinto
method that sets all the internal variables by walking through the fields, deserializing them one at a time. uproot.interp.objects.STLVector
and friends are examples of that.
In principle, all non-numerical, not-simply-jagged data could be (and once were) interpreted as uproot.asgenobj
, meaning that the user got back Python objects and had to wait for the Python interpreter. Some objects, however, consist only of fixed-width fields (no matter how nested, as long as everything's a fixed number of bytes). An important example of that is TLorenzVector
, which consists of fE
and a nested TVector3
, which is fX
, fY
, fZ
. By instead interpreting these with uproot.asdtype
, we can avoid the slow Python, casting the arrays en masse, without having to walk over them.
However, if your case really has a variable number of [("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")]
items per object, there is no choice: it must be walked over. This is why high splitting levels ("columnar data structures") are so valuable.
Thanks for the quick reply. I already had the fear that it’s the only way to go. :(
I started to parse the whole binary blob but it seemed very awkward and hard to understand the way ROOT stores them. So do I understand correctly that I need to parse the complete branch blob by myself or is there a way to get uproot do part of the job? Reverse engineering the raw structure is a nightmare but whom am I telling this 😅
That's correct. That's what "not splitting" means: the entire object's data are serialized sequentially and they have to be deserialized field by field. "Splitting" gives us a nice simplification—usually only the numbers we want (though occasionally with a byte or two extra!)—but "not splitting" is the long haul.
It's made somewhat easier by the fact that ROOT follows certain patterns, and these patterns have been encapsulated in the Cursor
object. The Source
is just a bucket of bytes, queried with a data(start, stop)
method, but Cursor
queries it with a number of patterns predefined. For instance, cursor.string(source)
reads a byte for the string size, reads another 4 bytes if that was 255
, then reads the number of bytes into a string. cursor.fields(struct.Struct(...))
uses Python's struct
module to read some fields. (Hint: they're all big-endian.)
Also, the uproot.interp.auto.interpret
function is supposed to figure out what that sequence of Cursor
calls is supposed to be, based on the branch's streamer. Is that not working? If so, where does it fail?
Yeah I see thanks so far?
Well to be honest, I have not used the Cursor
nor the auto.interpret
yet. I just tried to parse the .tostring()
binary blob manually to find a way to read it with awkward and lazyarrays using the dtype structure I find.
I’ll read about the Cursor
and interpret function, if I understood correctly that’s the way to get uproot at my side for this task 😉
auto.interpret
is what gets called to fill in the default interpretation
of each branch. If the interpretation
is None
, then it failed at some point. It might be failing just before having solved most of the problem for you.
This function is a growing list of rules learned from examples, so there will always be cases it doesn't handle.
I see, I will debut that by hand and see what happens. Pretty sure it has choked on buffer
which has variable length...
Is this a resolved thing? Should it be something I look into during the Uproot4 development? Thanks!
I'll continue investigating this with the uproot4 tools, for now the problem is solved with re-splitting the files ;)
I have trouble with one specific data stream in our ROOT files, which is the
KM3NET_TIMESLICE
format. This particular data has a specific structure and is written in different branches (where the data is filtered differently, but the structure is the same) with differing "TBranch split levels".I can read the data which has split level
4
with uproot easily, but two other branches have split level2
and0
, which I could not parse yet.The biggest (and probably the only) problem is the
buffer
field, which has it's own "subdtype".To be specific, the
KM3NET_TIMESLICE_L0
has split level 2 and the other oneKM3NET_TIMESLICE_L2
split level 4:In case of the split level 4 branch (
KM3NET_TIMESLICE_L2
) I can parse thebuffer
(which is a list of hits).Here is the output of
show()
This works fine because I have direct access to the
buffer
viasuperframes[b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer']
.For lower split levels however, I cannot access it, but the whole blob in one go. Here, the output of
show()
is:I tried to parse it, but I need to define a nested
dtype
, which is as far as I understood not possible yet. This is just an example how thedtype
could look like, notice that I am clueless about thebuffer
part, although I know it's an array of[("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")]
structs.Here are two files, one which contains an L0 timeslice branch with split level 2 and one with L1 timeslice and split level 4:
files.zip
Many thanks for any input in advance!