scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

TBranch split level #433

Closed tamasgal closed 4 years ago

tamasgal commented 4 years ago

I have trouble with one specific data stream in our ROOT files, which is the KM3NET_TIMESLICE format. This particular data has a specific structure and is written in different branches (where the data is filtered differently, but the structure is the same) with differing "TBranch split levels".

I can read the data which has split level 4 with uproot easily, but two other branches have split level 2 and 0, which I could not parse yet.

The biggest (and probably the only) problem is the buffer field, which has it's own "subdtype".

To be specific, the KM3NET_TIMESLICE_L0 has split level 2 and the other one KM3NET_TIMESLICE_L2 split level 4:

timeslice_branches

In case of the split level 4 branch (KM3NET_TIMESLICE_L2) I can parse the buffer (which is a list of hits).

Here is the output of show()

>>> f['KM3NET_TIMESLICE_L2']['vector<KM3NETDAQ::JDAQSuperFrame>'].show()
vector<KM3NETDAQ::JDAQSuperFrame>
                           TStreamerSTL               asdtype('>i4')
vector<KM3NETDAQ::JDAQSuperFrame>.length
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.type
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.fUniqueID
                           TStreamerBasicType         asjagged(asdtype('>u4'))
vector<KM3NETDAQ::JDAQSuperFrame>.fBits
                           TStreamerBasicType         asjagged(asdtype('>u4'))
vector<KM3NETDAQ::JDAQSuperFrame>.detector_id
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.run
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.frame_index
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.timeslice_start.UTC_seconds
                           TStreamerBasicType         asjagged(asdtype('>u4'))
vector<KM3NETDAQ::JDAQSuperFrame>.timeslice_start.UTC_16nanosecondcycles
                           TStreamerBasicType         asjagged(asdtype('>u4'))
vector<KM3NETDAQ::JDAQSuperFrame>.id
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.daq
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.status
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.fifo
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.status_3
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.status_4
                           TStreamerBasicType         asjagged(asdtype('>i4'))
vector<KM3NETDAQ::JDAQSuperFrame>.numberOfHits
                           TStreamerBasicType         asjagged(asdtype('>u4'))
vector<KM3NETDAQ::JDAQSuperFrame>.buffer
                           TStreamerLoop              None
f = uproot.open("file.root")
tree = f[b'KM3NET_TIMESLICE_L2'][b'KM3NETDAQ::JDAQTimeslice']
superframes = tree[b'vector<KM3NETDAQ::JDAQSuperFrame>']
hits_buffer = superframes[
b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer'].lazyarray(
    uproot.asjagged(uproot.astable(
        uproot.asdtype([("pmt", "u1"), ("tdc", "u4"),
              ("tot", "u1")])),
           skipbytes=6),
        basketcache=uproot.cache.ThreadSafeArrayCache(23*1024**2))

This works fine because I have direct access to the buffer via superframes[b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer'].

For lower split levels however, I cannot access it, but the whole blob in one go. Here, the output of show() is:

>>> f['KM3NET_TIMESLICE_L0']['vector<KM3NETDAQ::JDAQSuperFrame>'].show()
vector<KM3NETDAQ::JDAQSuperFrame>
                           TStreamerSTL               asjagged(astable(asdtype("[('length', '>i4'), ('type', '>i4'), (' fBits', '>u8'), (' fUniqueID', '>u8'), ('detector_id', '>i4'), ('run', '>i4'), ('frame_index', '>i4'), ('UTC_seconds', '>u4'), ('UTC_16nanosecondcycles', '>u4'), ('id', '>i4'), ('daq', '>i4'), ('status', '>i4'), ('fifo', '>i4'), ('status_3', '>i4'), ('status_4', '>i4'), ('numberOfHits', '>u4')]", "[('length', '<i4'), ('type', '<i4'), ('detector_id', '<i4'), ('run', '<i4'), ('frame_index', '<i4'), ('UTC_seconds', '<u4'), ('UTC_16nanosecondcycles', '<u4'), ('id', '<i4'), ('daq', '<i4'), ('status', '<i4'), ('fifo', '<i4'), ('status_3', '<i4'), ('status_4', '<i4'), ('numberOfHits', '<u4')]")), 10)

I tried to parse it, but I need to define a nested dtype, which is as far as I understood not possible yet. This is just an example how the dtype could look like, notice that I am clueless about the buffer part, although I know it's an array of [("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")] structs.

dtype = uproot.asjagged(
    uproot.astable(
        uproot.asdtype(
            [('length', '>i4'),
             ('type', '>i4'),
             ('fBits', '>u8'),
             ('fUniqueID', '>u8'), 
             ('detector_id', '>i4'), 
             ('run', '>i4'), 
             ('frame_index', '>i4'), 
             ('UTC_seconds', '>u4'), 
             ('UTC_16nanosecondcycles', '>u4'), 
             ('id', '>i4'), ('daq', '>i4'), 
             ('status', '>i4'), 
             ('fifo', '>i4'), 
             ('status_3', '>i4'), 
             ('status_4', '>i4'), 
             ('numberOfHits', '>u4'),
             ('buffer', '...')   # ???
            ])
    ), 
    skipbytes=10)

Here are two files, one which contains an L0 timeslice branch with split level 2 and one with L1 timeslice and split level 4:

files.zip

Many thanks for any input in advance!

jpivarski commented 4 years ago

This is just an example how the dtype could look like, notice that I am clueless about the buffer part, although I know it's an array of [("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")] structs.

If it's a variable-length list of some structure, then you can't use uproot.asdtype or uproot.astable. It has to be an uproot.asgenobj, filling Python classes on demand. The Python class has a _readinto method that sets all the internal variables by walking through the fields, deserializing them one at a time. uproot.interp.objects.STLVector and friends are examples of that.

In principle, all non-numerical, not-simply-jagged data could be (and once were) interpreted as uproot.asgenobj, meaning that the user got back Python objects and had to wait for the Python interpreter. Some objects, however, consist only of fixed-width fields (no matter how nested, as long as everything's a fixed number of bytes). An important example of that is TLorenzVector, which consists of fE and a nested TVector3, which is fX, fY, fZ. By instead interpreting these with uproot.asdtype, we can avoid the slow Python, casting the arrays en masse, without having to walk over them.

However, if your case really has a variable number of [("pmt", "u1"), ("tdc", "u4"), ("tot", "u1")] items per object, there is no choice: it must be walked over. This is why high splitting levels ("columnar data structures") are so valuable.

tamasgal commented 4 years ago

Thanks for the quick reply. I already had the fear that it’s the only way to go. :(

I started to parse the whole binary blob but it seemed very awkward and hard to understand the way ROOT stores them. So do I understand correctly that I need to parse the complete branch blob by myself or is there a way to get uproot do part of the job? Reverse engineering the raw structure is a nightmare but whom am I telling this 😅

CEF7F884-3725-4099-BE2F-70261F8669D3

jpivarski commented 4 years ago

That's correct. That's what "not splitting" means: the entire object's data are serialized sequentially and they have to be deserialized field by field. "Splitting" gives us a nice simplification—usually only the numbers we want (though occasionally with a byte or two extra!)—but "not splitting" is the long haul.

It's made somewhat easier by the fact that ROOT follows certain patterns, and these patterns have been encapsulated in the Cursor object. The Source is just a bucket of bytes, queried with a data(start, stop) method, but Cursor queries it with a number of patterns predefined. For instance, cursor.string(source) reads a byte for the string size, reads another 4 bytes if that was 255, then reads the number of bytes into a string. cursor.fields(struct.Struct(...)) uses Python's struct module to read some fields. (Hint: they're all big-endian.)

Also, the uproot.interp.auto.interpret function is supposed to figure out what that sequence of Cursor calls is supposed to be, based on the branch's streamer. Is that not working? If so, where does it fail?

tamasgal commented 4 years ago

Yeah I see thanks so far?

Well to be honest, I have not used the Cursor nor the auto.interpret yet. I just tried to parse the .tostring() binary blob manually to find a way to read it with awkward and lazyarrays using the dtype structure I find. I’ll read about the Cursor and interpret function, if I understood correctly that’s the way to get uproot at my side for this task 😉

jpivarski commented 4 years ago

auto.interpret is what gets called to fill in the default interpretation of each branch. If the interpretation is None, then it failed at some point. It might be failing just before having solved most of the problem for you.

This function is a growing list of rules learned from examples, so there will always be cases it doesn't handle.

tamasgal commented 4 years ago

I see, I will debut that by hand and see what happens. Pretty sure it has choked on buffer which has variable length...

jpivarski commented 4 years ago

Is this a resolved thing? Should it be something I look into during the Uproot4 development? Thanks!

tamasgal commented 4 years ago

I'll continue investigating this with the uproot4 tools, for now the problem is solved with re-splitting the files ;)