scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Type error when slicing vector<vector<float>> ObjectArray branch #519

Open asopio opened 4 years ago

asopio commented 4 years ago

Hi, I encountered some unexpected behaviour when trying to slice a branch of type vector<vector<float>>, containing vectors of variable length. Getting the array returns the correct result:


t_test = uproot.open("testfile.root")["testtree"]
tnew['testbranch'].array()[:,0]

<ObjectArray [[[0.12897478, 0.33032617, 0.0053174845, 0.0062657734, 0.16532688, 0.07413943, 0.25900677, 0.084950365], [0.35509595]] [[0.15334472, 0.10603446, 0.004459681, 0.0222604, 0.38413692, 0.39519757], [0.006101171, 0.014523825, 0.036121503, 0.039234158, 0.22525033, 0.34617442]] [[0.02478936, 0.06844998, 0.027409986, 0.0065677217, 0.18938226, 0.0432718, 0.010767343, 0.18636794, 0.4008715, 0.41020113], [0.37996235, 0.34230024]] ... [[0.2614188, 0.20125817, 0.046599977, 0.05807799, 0.010248912, 0.020693984, 0.05169987, 0.10514213, 0.43268573], [0.29535455]] [[0.23408501, 0.14874455, 0.016859913, 0.008109958, 0.010894945, 0.020351907, 0.38372123, 0.09444051]] [[0.070886604, 0.4732762, 0.024249015, 0.04118293, 0.49230808, 0.14104557, 0.17984901, 0.40091348], [0.48583144, 0.030749094, 0.015568347, 0.18870124, 0.12960392, 0.16022786, 0.076917425, 0.2069286, 0.052053016, 0.33399573]]] at 0x7f4e0ab48ad0>

... but slicing seems to convert the entries to uints:

tnew['testbranch'].array()[:,0]

[array([  0,   0,   0,   2,   0,   0,   0,   8,  62,   4,  17, 247,  62,
        169,  32, 131,  59, 174,  62,  75,  59, 205,  81,  30,  62,  41,
         75, 115,  61, 151, 214, 106,  62, 132, 156, 137,  61, 173, 250,
        117,   0,   0,   0,   1,  62, 181, 207,  35], dtype=uint8),
 array([  0,   0,   0,   2,   0,   0,   0,   6,  62,  29,   6, 102,  61,
        217,  40, 152,  59, 146,  34, 132,  60, 182,  91, 113,  62, 196,
        173, 152,  62, 202,  87,  86,   0,   0,   0,   6,  59, 199, 236,
         85,  60, 109, 245,  86,  61,  19, 244,  36,  61,  32, 179, 255,
         62, 102, 168,   6,  62, 177,  61, 198], dtype=uint8),
 array([  0,   0,   0,   2,   0,   0,   0,  10,  60, 203,  19,  14,  61,
        140,  47, 129,  60, 224, 138, 232,  59, 215,  54,  11,  62,  65,
        237, 108,  61,  49,  61, 197,  60,  48, 105, 131,  62,  62, 215,
         61,  62, 205,  63,   7,  62, 210,   5, 226,   0,   0,   0,   2,
         62, 194, 138, 109,  62, 175,  65, 250], dtype=uint8),
 ...]

Surely, this is not intentional.

cheers, Alex

arraybug.zip

jpivarski commented 4 years ago

This is a fundamental issue with Awkward 0's ObjectArray: the elements are Python objects. You're right that the above behavior is not intended, but no behavior is intended for tnew['testbranch'].array()[:, 0] where the first axis is the ObjectArray. ("The behavior is undefined.") Multidimensional slices can't cross the boundary from outside an ObjectArray into an ObjectArray—the fact that anything happened at all is an accident.

In most cases like this, you encounter some error. Somehow, this example happened to slip through code it wasn't intended for and yield a result—a bizarre one—through sheer happenstance. Perhaps the right thing to do would have been to detect these multidimensional slices in ObjectArray and refuse it upfront: define the behavior to be an error. Awkward 1 is more aggressive about identifying and refusing undefined behavior (because half of it is written in C++, where undefined behavior would be more dangerous).

This is the reason that Awkward 1 has no ObjectArray and Uproot 4 is more of a rewrite than an upgrade to take advantage of it. In Uproot 4, std::vector<float> → ListArray(NumpyArray) and std::vector<std::vector<float>> → ListArray(ListArray(NumpyArray)), etc. to any depth, despite the fact that the ROOT serialization for these cases are very different. (The motivation for Uproot 3 creating ObjectArrays was to mirror the way it's serialized in ROOT, but we now know that it's better to hide that difference.)

Your example works fine in Awkward 1/Uproot 4:

>>> import uproot4
>>> t = uproot4.open("issue519.root")["testtree"]
>>> a = t["testbranch"].array()

>>> a
<Array [[[0.129, 0.33, ... 0.0869, 0.197]]] type='10 * var * var * float64'>
>>> a[:, 0]
<Array [[0.129, 0.33, ... 0.0869, 0.197]] type='10 * var * float64'>

>>> import awkward1 as ak
>>> ak.num(a)     # most of these sublists have only one element
<Array [2, 2, 2, 2, 1, 1, 1, 1, 1, 1] type='10 * int64'>

>>> a[:, -1]      # so hey, let's look at the last in each
<Array [[0.355], ... 0.458, 0.0869, 0.197]] type='10 * var * float64'>

It's already the case that Awkward 1 is more mature than Awkward 0, and it's better documented. I've been recommending Awkward 1 since March with ak.from_awkward0/ak.to_awkward0 for brief forays into the new library/new interface, so that it can be minimally adopted without breaking existing scripts.

Uproot 4 is lacking documentation and the file-writing feature, so I've been cautiously recommending it on a trial basis. I'm working on the documentation now, and when that's done, I'll be recommending Uproot 4 more forcefully. When the file-writing feature is also done, Uproot 4 will have feature parity with Uproot 3 and I'll actually move the GitHub and PyPI names:

so that pip install awkward uproot installs the new ones and users will either have to change their scripts or import uproot3 as uproot. That will happen before the end of this year.

On the point of ROOT deserialization, Uproot 4 is already more capable than Uproot 3, so unless you're interested in file-writing or don't want to change interface yet, you'll probably prefer Uproot 4.