scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Dimensional issue of subclassed object in nested branches #465

Closed tamasgal closed 4 years ago

tamasgal commented 4 years ago

Sorry for the weird issue title, we can change it later, I could not find a better one 😉

I need some help parsing a fairly simple data structure. It derives from a class which has:

struct AAObject : public TObject
{
  std::vector<double>      usr;              ///< user data
  std::vector<std::string> usr_names;        ///< user keys

These two vectors are used to store arbitrary data, so that e.g. usr_names = ["bx", "by", "ichan", "cc"] and the corresponding values are at the specific indices. So if you need to look up the value for "by", you look up the index of it and then access usr[idx].

So far so good, it works for "one dimensional" (flat) branches. Here, the event of class Evt derives from that AAObject and the the branch Evt simply contains a flat vector of Evt instances (3 of them) and each event contains 17 "usr entries":

[ins] In [1]: import uproot

[ins] In [2]: f = uproot.open("usr-flat.root")

[ins] In [3]: f['E']['Evt']['usr_names'].array()
Out[3]: <ObjectArray [[b'RecoQuality', b'RecoNDF', b'CoC', b'ToT', b'ChargeAbove', b'ChargeBelow', b'ChargeRatio', b'DeltaPosZ', b'FirstPartPosZ', b'LastPartPosZ', b'NSnapHits', b'NTrigHits', b'NTrigDOMs', b'NTrigLines', b'NSpeedVetoHits', b'NGeometryVetoHits', b'ClassficationScore'] [b'RecoQuality', b'RecoNDF', b'CoC', b'ToT', b'ChargeAbove', b'ChargeBelow', b'ChargeRatio', b'DeltaPosZ', b'FirstPartPosZ', b'LastPartPosZ', b'NSnapHits', b'NTrigHits', b'NTrigDOMs', b'NTrigLines', b'NSpeedVetoHits', b'NGeometryVetoHits', b'ClassficationScore'] [b'RecoQuality', b'RecoNDF', b'CoC', b'ToT', b'ChargeAbove', b'ChargeBelow', b'ChargeRatio', b'DeltaPosZ', b'FirstPartPosZ', b'LastPartPosZ', b'NSnapHits', b'NTrigHits', b'NTrigDOMs', b'NTrigLines', b'NSpeedVetoHits', b'NGeometryVetoHits', b'ClassficationScore']] at 0x7f1b8fa2a880>

[ins] In [4]: f['E']['Evt']['usr_names'].array()[0]
Out[4]:
[b'RecoQuality',
 b'RecoNDF',
 b'CoC',
 b'ToT',
 b'ChargeAbove',
 b'ChargeBelow',
 b'ChargeRatio',
 b'DeltaPosZ',
 b'FirstPartPosZ',
 b'LastPartPosZ',
 b'NSnapHits',
 b'NTrigHits',
 b'NTrigDOMs',
 b'NTrigLines',
 b'NSpeedVetoHits',
 b'NGeometryVetoHits',
 b'ClassficationScore']

[ins] In [5]: f['E']['Evt']['usr'].array()[0]
Out[5]:
array([8.54595724e+01, 3.70000000e+01, 1.18630282e+02, 8.25000000e+02,
       1.76000000e+02, 6.49000000e+02, 2.13333333e-01, 3.75196777e+01,
       1.35294997e+02, 9.77753193e+01, 5.10000000e+01, 3.00000000e+01,
       7.00000000e+00, 6.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.68633822e-01])

[ins] In [6]: len(f['E']['Evt'])
Out[6]: 3

The problem appears with classes which have instances in nested branches. This means that for example the Trk class, which also derives from AAObject and is part of the Evt branch. Each Evt entry has a variable length of Trks, as seen here (just the relevant parts):

struct Evt: public AAObject
{
...
  std::vector<Hit> hits;        ///< list of hits
  std::vector<Trk> trks;        ///< list of reconstructed tracks (can be several because of prefits,showers, etc).   
...

The Trk class itself is also quite straight forward and consists of some attributes:

struct Trk: public AAObject
{
  int    id;                          ///< track identifier
  Vec    pos;                         ///< postion of the track at time t
  Vec    dir;                         ///< track direction
  double t;                           ///< track time (when the particle is at pos )
...

The file usr-nested contains a few events and every event contains multiple Trk instances. Only the first two Trk entries should have some entries in the usr* attributes, the first one by, bx, ichan and cc, the second one only energy_lost_in_can. The structure of the arrays however I get back from uproot are all one dimensional per event. It seems that only the first entry is extracted and also the length of the arrays seems a bit "random".

This is what I get:

[ins] In [34]: import uproot

[ins] In [35]: f = uproot.open("usr-nested.root")

[ins] In [36]: f['E']['Evt']['mc_trks']['mc_trks.usr_names'].array()
Out[36]: <ObjectArray [[b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] ... [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc'] [b'bx', b'by', b'ichan', b'cc']] at 0x7ff1e6587040>

[ins] In [37]: f['E']['Evt']['mc_trks']['mc_trks.usr_names'].array()[0]
Out[37]: [b'bx', b'by', b'ichan', b'cc']

[ins] In [38]: f['E']['Evt']['mc_trks']['mc_trks.usr'].array()[0]
Out[38]:
array([ 4.86920000e-002,  5.88460000e-002,  3.00000000e+000,
        2.00000000e+000,  2.65507278e-314, -9.07373616e+207,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000])

[ins] In [39]: f['E']['Evt']['mc_trks']['mc_trks.pos.x'].array()[0]
Out[39]:
array([32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263,
       32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263, 32.263,
       32.263, 32.263, 32.263, 32.263, 32.263])

[ins] In [40]: f['E']['Evt']['mc_trks']['mc_trks.pos.x'].array()[0].shape
Out[40]: (21,)

What I expect, is that the following line returns a nested list, where the first nested list has a length of 4, the second 1 and all others are empty. I however get 15 single entries and the first 4 entries correspond to the first track (and the values are OK):

[ins] In [38]: f['E']['Evt']['mc_trks']['mc_trks.usr'].array()[0]
Out[38]:
array([ 4.86920000e-002,  5.88460000e-002,  3.00000000e+000,
        2.00000000e+000,  2.65507278e-314, -9.07373616e+207,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
        0.00000000e+000,  0.00000000e+000,  0.00000000e+000])

The array elements apart from the first 4 values seem to be random memory bits.

I tried to figure out why the data is parsed incorrectly but I failed so far. I also did not found the word energy_lost_in_can in the usr_fields (converted to string etc.), so I guess it is lost somewhere in the low level parsing in uproot.

This is what the ROOT based library spits out for the first event (all 21 tracks, the first one with 4 usr-entries, the second with 1 and every other with no entries):

Trk: id=0 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.669588 -0.539513 0.510467 t=0 E=67.213 pdg-type=14
0    bx :    0.048692
1    by :    0.058846
2    ichan :     3
3    cc :    2
Trk: id=1 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.663003 -0.546697 0.51142 t=0 E=63.347 pdg-type=-13
0    energy_lost_in_can :    63.2413
Trk: id=2 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.19609 -0.948629 -0.248297 t=0 E=1.2184 pdg-type=2212
Trk: id=3 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.126922 -0.514083 0.848298 t=0 E=0.9756 pdg-type=2212
Trk: id=4 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.201009 0.875771 0.438886 t=0 E=1.264 pdg-type=2212
Trk: id=5 pos=Vec:32.263 10.292 117.624 dir=Vec:0.03499 -0.730037 -0.682511 t=0 E=1.1212 pdg-type=2212
Trk: id=6 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.013096 0.280756 0.95969 t=0 E=1.0763 pdg-type=2212
Trk: id=7 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.549502 0.013973 0.835376 t=0 E=1.0252 pdg-type=2212
Trk: id=8 pos=Vec:32.263 10.292 117.624 dir=Vec:0.882835 -0.235994 0.406091 t=0 E=1.0421 pdg-type=2212
Trk: id=9 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.944463 -0.174758 -0.278297 t=0 E=0.99328 pdg-type=2112
Trk: id=10 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.801781 -0.360305 0.476789 t=0 E=1.9418 pdg-type=2112
Trk: id=11 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.488549 -0.831056 0.26583 t=0 E=1.3408 pdg-type=2212
Trk: id=12 pos=Vec:32.263 10.292 117.624 dir=Vec:0.221472 -0.012973 0.97508 t=0 E=1.0457 pdg-type=2112
Trk: id=13 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.7151 0.448516 -0.536158 t=0 E=1.0784 pdg-type=2112
Trk: id=14 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.837188 0.317147 -0.445571 t=0 E=1.0119 pdg-type=2112
Trk: id=15 pos=Vec:32.263 10.292 117.624 dir=Vec:0.960773 0.064413 0.269754 t=0 E=1.054 pdg-type=2112
Trk: id=16 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.878487 0.279885 -0.387201 t=0 E=0.15467 pdg-type=22
Trk: id=17 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.463415 0.821182 -0.333027 t=0 E=0.16669 pdg-type=11
Trk: id=18 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.468878 0.822227 -0.32264 t=0 E=0.085231 pdg-type=-11
Trk: id=19 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.870512 -0.390102 -0.30005 t=0 E=0.25603 pdg-type=22
Trk: id=20 pos=Vec:32.263 10.292 117.624 dir=Vec:-0.531486 -0.838616 0.119353 t=0 E=0.14464 pdg-type=22

Do you have any idea, or is this a known issue with nested vectors?

I attached both files in case you want to have a look.

usr.zip

jpivarski commented 4 years ago

My understanding of your problem is that (a) you expect multidimensional arrays for f['E']['Evt']['mc_trks']['mc_trks.usr'] in each event but get a one-dimensional array in each event and (b) the lengths of these arrays are too long.

On (a), I would expect a one-dimensional array in each event. The Evt struct contains a single std::vector<Trk> trks and that Trk is flat: just an int, a Vec (which gets split into .x, .y, .z branches) and a double. I only count one dimension there—maybe I've misunderstood you?

On (b), I can see the problem: the extra values are physically in the ROOT file, but the length of your std::vector is specified to be shorter than what has been written. You can see it with the uproot.asdebug interpretation:

>>> f['E']['Evt']['mc_trks']['mc_trks.usr'].array(uproot.asdebug)[0]
array([ 64,   0,   0, 126,   0,   9,   0,   0,   0,   4,  63, 168, 238,
        40, 103,  39,  86, 134,  63, 174,  33,  16,  27,   0,  54, 135,
        64,   8,   0,   0,   0,   0,   0,   0,  64,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   1,  64,  79, 158, 226, 235,  28,
        67,  45,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
      dtype=uint8)

The first 6 bytes (64, 0, 0, 126, 0, 9) are STL header, the next 4 bytes (0, 0, 0, 4) is the length of the array, and the rest is both real data and not-real data. The interpretation for this branch,

>>> f['E']['Evt']['mc_trks']['mc_trks.usr'].interpretation
asjagged(asdtype('>f8'), 10)

just skips over the first 10 bytes, expecting the rest to be meaningful data. In all the cases I've seen up to this point, it has been. In your case, there's junk padding after the meaningful data.

So ROOT has more degrees of freedom than previously thought and we have to actually check the 4 byte size in every STL vector jagged array. I'm going to see about adding a parameter for that.