scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Cracking the TSreamerLoop #408

Closed tamasgal closed 5 years ago

tamasgal commented 5 years ago

This is a follow-up on https://github.com/scikit-hep/uproot/issues/407

The main problem is a branch which cannot be parsed automatically and has a streamer called TStreamerLoop.

I managed to access all the data I need, but it's far from optimal from the users point of few: I need uproot.asdebug and additional information about the underlying data structure which needs to be hardcoded.

Here is a complete script which is running uproot v3.10.12 and parses a ROOT file from https://github.com/KM3NeT/km3io/blob/master/tests/samples/jpp_v12.0.0.root

A bit of background information: the data of interest is in the vector<KM3NETDAQ::JDAQSuperFrame>.buffer branch and consists of a ragged array of JDAQHits which is a simple struct:

    typedef  unsigned char             JPMT_t;    //!< PMT channel in FPGA
    typedef  unsigned int              JTDC_t;    //!< leading edge [ns]
    typedef  unsigned char             JTOT_t;    //!< time over threshold [ns]

The actual script with explanations:

import uproot

uproot.__version__   # 3.10.12

f = uproot.open("jpp_v12.0.0.root")

tree = f["KM3NET_TIMESLICE_L1"]

tree.keys() # [b'km3net_timeslice_L1']

frames = tree[b'km3net_timeslice_L1'][b'KM3NETDAQ::JDAQTimeslice'][b'vector<KM3NETDAQ::JDAQSuperFrame>']

frames.show()
# vector<KM3NETDAQ::JDAQSuperFrame>
#                           TStreamerSTL               asdtype('>i4')
#vector<KM3NETDAQ::JDAQSuperFrame>.length
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.type
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.fUniqueID
#                           TStreamerBasicType         asjagged(asdtype('>u4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.fBits
#                           TStreamerBasicType         asjagged(asdtype('>u4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.detector_id
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.run
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.frame_index
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.timeslice_start.UTC_seconds
#                           TStreamerBasicType         asjagged(asdtype('>u4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.timeslice_start.UTC_16nanosecondcycles
#                           TStreamerBasicType         asjagged(asdtype('>u4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.id
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.daq
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.status
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.fifo
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.status_3
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.status_4
#                           TStreamerBasicType         asjagged(asdtype('>i4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.numberOfHits
#                           TStreamerBasicType         asjagged(asdtype('>u4'))
#vector<KM3NETDAQ::JDAQSuperFrame>.buffer
                           TStreamerLoop              None

As seen, all branches are parsed correctly but the the .buffer, which has a TStreamerLoop streamer.

frames[b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer'].show()
# vector<KM3NETDAQ::JDAQSuperFrame>.buffer
#                           TStreamerLoop              None

To explore the buffer:

```python
buffer = frames[b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer']

buffer.array(uproot.asdebug)
# <JaggedArray [[64 4 130 ... 237 5 30] [64 4 125 ... 245 5 17] [64 4 128 ... 245 5 26]] at 0x0001139f1610>

# The very first byte of 64 is kind of a ROOT header as you (Jim) mentioned in https://github.com/scikit-hep/uproot/issues/407?

# The file contains 3 super frames (timeslices), and each of them consists of 69 frames
# the 69 however is not fixed, it can vary from slice to slice! Each frame
# consists of hits (vector of JDAQHit<6bytes>) and corresponds to an optical module
# which registers those. It is already a good sign that the jagged arrays has 3 sub-arrays, since
# we have 3 super frames!

# The first timeslice which contains exactly 49243 hits from various different
# optical modules
ts = buffer.array(uproot.asdebug)[0]

ts
# array([ 64,   4, 130, ..., 237,   5,  30], dtype=uint8)

len(ts)
# 295464

295464 bytes - 6 bytes offset (don't yet understand why 6, but it works) give 295458 bytes of raw JDAQHit data, which makes sense, since we know that we have 49243 hits and 49243*6 == 295458

Now to parse the data, I extract a flat array of hits and manually slice it into a dictionary

flat_array_of_hits = frames[b'vector<KM3NETDAQ::JDAQSuperFrame>.buffer'].array(
                uproot.asjagged(uproot.astable(
                    uproot.asdtype([("pmt", "u1"),
                                    ("tdc", "u4"),
                                    ("tot", "u1")])),
                                skipbytes=6))

ts_hits = flat_array_of_hits[0]
len(ts_hits)
# 49243

ts_hits["tot"]
# array([27, 24,  5, ..., 44, 14, 30], dtype=uint8)

So the problem is now, that it's a flat array of hits, however, they should be grouped the module ID .numberOfHits and .id tells us more:

n_hits = frames[b'vector<KM3NETDAQ::JDAQSuperFrame>.numberOfHits'].array()[0]
n_hits
# array([ 984, 1004,  251,  729,  787,  243,  223,  891,  693, 1138,  524,
#        1089,  734,  819,  927,  849,  281,  711,  913,  714, 1046,  681,
#         515,  708,  762,  570,  433, 1106,  987,   94,  798,  312,  776,
#         224,  612,  443,  803,  774,  719,  792,  625,  229,  454,  634,
#         533, 1103,  693,  776,  815,  442,  671,  939,  876,  803,  643,
#        1058,  501, 1126, 1028,  442,  919,  670,  339, 1057,  800, 1010,
#         863,  809,  726], dtype=uint32)

module_ids = frames[b'vector<KM3NETDAQ::JDAQSuperFrame>.id'].array()[0]
module_ids
# array([806451572, 806455814, 806465101, 806483369, 806487219, 806487226,
#        806487231, 808432835, 808435278, 808447180, 808447186, 808451904,
#        808451907, 808469129, 808472260, 808472265, 808488895, 808488990,
#        808489014, 808489117, 808493910, 808946818, 808949744, 808951460,
#        808956908, 808959411, 808961448, 808961480, 808961504, 808961655,
#        808964815, 808964852, 808964883, 808964908, 808969848, 808969857,
#        808972593, 808972598, 808972698, 808974758, 808974773, 808974811,
#        808974972, 808976377, 808979567, 808979721, 808979729, 808981510,
#        808981523, 808981672, 808981812, 808981864, 808982005, 808982018,
#        808982041, 808982066, 808982077, 808982547, 808984711, 808996773,
#        808997793, 809006037, 809007627, 809503416, 809521500, 809524432,
#        809526097, 809544058, 809544061], dtype=int32)

Now the fun part: the first set of hits belong to module (.id) == 806451572, and consists of 984 hits

Here is an inefficient loop to fill the hits:

hits = {}
idx = 0
for module_id, n_hits in zip(module_ids, n_hits):
    hits[module_id] = ts_hits[idx:idx+n_hits]
    idx += n_hits

# quick check against the PyROOT framework:
# [ins] In [15]: ts_hits[ts_hits.dom_id == 808972593]
# Out[15]: TimesliceHits <class 'km3pipe.dataclasses.Table'> (rows: 803)

len(hits[808972593])
# 803, fine!

# another check to see if the data is correct, expecting:
# [ins] In [16]: ts_hits[ts_hits.dom_id ==808972593].tot[:10]
# Out[16]: array([25, 31, 43, 21, 25, 24, 31, 25, 30, 23], dtype=uint8)
hits[808972593]["tot"][:10]
# array([25, 31, 43, 21, 25, 24, 31, 25, 30, 23], dtype=uint8)

Perfect! But a bit "whacky" and inefficient ;) I am not so familiar with [jr]agged arrays yet, but I am sure one can somehow pass this this structure to the flat array and get it in one shot or something like this? The module IDs are then just encoded by the index and one need another lookup in .id or so...

jpivarski commented 5 years ago

Discussed offline—the conclusion was that the above is a good way of dealing with complex C++ hierarchies that can't be easily (or can't be?) derived from streamers. We didn't find, for instance, anything in TStreamerLoop that indicated the data should have "pmt", "tdc", and "tot" fields.

So not wacky an inefficient! Cheers!

tamasgal commented 5 years ago

Thanks Jim!

Btw. during the live terminal session you could actually retrieve that superframes has frames and frames have numberOfHits JDAQHits. So I think that maybe providing some kind of message that "JDAQHits structure is required to parse ..." could be helpful in future. But that's just a comment if we pick this issue up again ;)