scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
848 stars 89 forks source link

Unintuitive slicing behaviour when slicing with Arrays #370

Closed drahnreb closed 4 years ago

drahnreb commented 4 years ago

Not sure if bug or feature.

Slicing by an array of indices would be very handy but currently fails (or is unreliable).

Minimal working example taken (mostly based on the README.md):

import awkward1 as ak # 0.2.27
array = ak.Array([
    [{"x": 1.1, "y": [1]}],
    [{"x": 2.2, "y": [11, 12]}],
    [{"x": 3.3, "y": [21, 22, 23]}],
    #[], # cannot slice this by index
    [{"x": 3.3, "y": [31, 32, 33]}],
    [{"x": 4.4, "y": [41, 42, 43, 44]}],
    [{"x": 5.5, "y": [51, 52, 53, 54, 55]}]
])
# slicing should work by python objects or numpy
# but singleton seems to produce more reliable results
# strangely singletons sometimes do not convert 1-D numpy
# idx = np.array([0, 0, 1, 1, 2, 2])#[:, np.newaxis]
startIndices = ak.singletons([[0], [0], [1], [1], [2], [2]])

# slice each `y` in `array` from start to end resp. [0], [0:1], [1:2], [1], [2:], [2:3]
# endIndices = ak.singletons([[0], [1], [2], [1], [None], [3]])

assert array.shape[0] == startIndices.shape[0]

# this works
array['y', ... , 1:]
# while this fails with ValueError: in ListArray64 attempting to get 1, index out of range
# but should return the same?
array['y', ... , 1]
# (as a consequence) this also fails
array['y', ... , startIndices]

Maybe I am missing something here. Eventually would be nice to achieve a slice from startIndices to endIndicecs without creating boolean arrays of the entire length or a numba for loop.

mask = np.array([[True], [True, True], [False, True, True], [False, True, False], [False, False, True, True], [False, False, True, False]])
array['y', mask]

Fails with ValueError: arrays used as an index must be a (native-endian) integer or boolean

jpivarski commented 4 years ago

Indexing can be legitimately confusing (also true of NumPy). I'll break this down to what I think you're asking.

First, the array of records can always be projected onto "y". I tried a number of combinations and didn't see any trouble with that. For simplicity of discussion here, instead of talking about

array = ak.Array([
    [{"x": 1.1, "y": [1]}],
    [{"x": 2.2, "y": [11, 12]}],
    [{"x": 3.3, "y": [21, 22, 23]}],
    #[], # cannot slice this by index   (if empty, you'll just have to pass an empty list in the slice)
    [{"x": 3.3, "y": [31, 32, 33]}],
    [{"x": 4.4, "y": [41, 42, 43, 44]}],
    [{"x": 5.5, "y": [51, 52, 53, 54, 55]}]
])

which has type

6 * var * {"x": float64, "y": var * int64}

we could talk about

array["y"]   # or array.y

which is

[[[1]], [[11, 12]], [[21, 22, 23]], [[31, 32, 33]], [[41, 42, 43, 44]], [[51, 52, 53, 54, 55]]]

with type

6 * var * var * int64

You can certainly do

>>> array["y", [[0], [0], [1], [1], [2], [2]]]
<Array [[[[1]]], [[[1, ... [[[21, 22, 23]]]] type='6 * 1 * var * var * int64'>

because each of the elements of the slice has length 1, just like array (and array.y) and the integer values in each is less than the length of each nested list:

>>> array.y[0, 0]
<Array [1] type='1 * int64'>                    # has an element 0
>>> array.y[1, 0]
<Array [11, 12] type='2 * int64'>               # has an element 0
>>> array.y[2, 0]
<Array [21, 22, 23] type='3 * int64'>           # has an element 1
>>> array.y[3, 0]
<Array [31, 32, 33] type='3 * int64'>           # has an element 1
>>> array.y[4, 0]
<Array [41, 42, 43, 44] type='4 * int64'>       # has an element 2
>>> array.y[5, 0]
<Array [51, 52, 53, 54, 55] type='5 * int64'>   # has an element 2
>>> array.y[6, 0]

and so that's why it works. ak.singletons has nothing to do with it: it's used to convert None values into empty lists and everything else into length-1 lists, which you already have.

Recent versions of NumPy provide a hint about why the mask didn't work:

>>> mask = np.array([
...     [True],
...     [True, True],
...     [False, True, True],
...     [False, True, False],
...     [False, False, True, True],
...     [False, False, True, False]])

raises the warning

<stdin>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a
list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to
do this, you must specify 'dtype=object' when creating the ndarray

This is a NumPy array of dtype=object, which soon won't be created automatically. Constructing the mask as an Awkward array is the first step:

>>> mask = ak.Array([
...     [True],
...     [True, True],
...     [False, True, True],
...     [False, True, False],
...     [False, False, True, True],
...     [False, False, True, False]])
>>> mask
<Array [[True], [True, ... False, True, False]] type='6 * var * bool'>

but it also needs the length-1 structure of startIndices to fit into the second axis:

>>> mask = ak.Array([
...     [[True]],
...     [[True, True]],
...     [[False, True, True]],
...     [[False, True, False]],
...     [[False, False, True, True]],
...     [[False, False, True, False]]])
>>> mask
<Array [[[True]], ... False, True, False]]] type='6 * var * var * bool'>
>>> array.y
<Array [[[1]], [[11, ... [51, 52, 53, 54, 55]]] type='6 * var * var * int64'>
>>> ak.num(mask, axis=2)
<Array [[1], [2], [3], [3], [4], [4]] type='6 * var * int64'>
>>> ak.num(array.y, axis=2)
<Array [[1], [2], [3], [3], [4], [5]] type='6 * var * int64'>

Okay; they line up: now we're ready to go!

>>> array.y[mask]
<Array [[[1]], [[11, 12], ... 43, 44]], [[53]]] type='6 * var * var * int64'>

About making a slice option that can be different at each level (e.g. slice list 1 with 0:0, list 2 with 1:2, list 3 with 0:2), that's an interesting idea, something that becomes useful in the context of ragged arrays that you wouldn't have with rectilinear arrays.

Right now, that sort of thing can be done by opening up the ak.Array structure and manipulating its memory layout:

>>> original = array.y.layout
>>> original
<ListOffsetArray64>
    <offsets><Index64 i="[0 1 2 3 4 5 6]" offset="0" length="7" at="0x55f65db71150"/></offsets>
    <content><ListOffsetArray64>
        <offsets><Index64 i="[0 1 3 6 9 13 18]" offset="0" length="7" at="0x55f65db75170"/></offsets>
        <content><NumpyArray format="l" shape="18" data="1 11 12 21 22 ... 51 52 53 54 55" at="0x55f65d654e60"/></content>
    </ListOffsetArray64></content>
</ListOffsetArray64>
>>> starts = np.asarray(original.content.starts)
>>> stops  = np.asarray(original.content.stops)
>>> starts, stops
(array([ 0,  1,  3,  6,  9, 13], dtype=int64),
 array([ 1,  3,  6,  9, 13, 18], dtype=int64))

Slicing with a different start[i] and stop[i] at each i is a matter of adding and subtracting the right number from these starts and stops. Be careful if you modify these NumPy arrays in place: they are views of the Awkward layout and will change the Awkward array in-place (one of the few ways Awkward arrays are mutable).

>>> starts = starts + [0, 0, 1, 1, 2, 2]
>>> stops  = stops  - [0, 0, 1, 1, 2, 2]
>>> starts, stops
(array([ 0,  1,  4,  7, 11, 15], dtype=int64), array([ 1,  3,  5,  8, 11, 16], dtype=int64))
>>> modified = ak.layout.ListOffsetArray64(
...     original.offsets,
...     ak.layout.ListArray64(
...         ak.layout.Index64(starts),
...         ak.layout.Index64(stops),
...         original.content.content))
>>> modified
<ListOffsetArray64>
    <offsets><Index64 i="[0 1 2 3 4 5 6]" offset="0" length="7" at="0x55f65db71150"/></offsets>
    <content><ListArray64>
        <starts><Index64 i="[0 1 4 7 11 15]" offset="0" length="6" at="0x55f65db704d0"/></starts>
        <stops><Index64 i="[1 3 5 8 11 16]" offset="0" length="6" at="0x55f65db5b0b0"/></stops>
        <content><NumpyArray format="l" shape="18" data="1 11 12 21 22 ... 51 52 53 54 55" at="0x55f65d654e60"/></content>
    </ListArray64></content>
</ListOffsetArray64>
>>> ak.Array(modified)
<Array [[[1]], [[11, 12]], ... [[]], [[53]]] type='6 * var * var * int64'>
>>> ak.Array(modified).tolist()
[[[1]], [[11, 12]], [[22]], [[32]], [[]], [[53]]]

And that's probably how a variable starts:stops would be implemented. But if the indexing is tricky, this is tricky-squared. It's pretty easy to make an array that's internally inconsistent (check with ak.is_valid and ak.validity_error).

drahnreb commented 4 years ago

Great, thank you very much for that swift clarification.

I was particularly looking for the second explanations (overall my motivation for awkward as I am dealing with that sort of tasks a lot in the context of jagged arrays).

You can close this issue. Maybe the first part could be part of the quickstart. Let me know if I can help fill the doc stubs with content.

drahnreb commented 4 years ago
starts = np.asarray(original.content.starts)

Throws an error with the same array: AttributeError: 'awkward1._ext.NumpyArray' object has no attribute 'starts'

jpivarski commented 4 years ago

Thanks for the offer! The stubs are there because I have to finish other projects (Uproot 4), which also need documentation—Awkward is half-there in that it has all the reference docs, and the ones in the Python API include examples.

If you submit a documentation issue with the examples you'd like to see fill the stub, I'll enter them into the stub. I don't have it set up as a wiki (whenever I do make a wiki, no one edits it!), in part because evaluating the JupyterBook is part of the build (to ensure that tutorial examples are not broken), which gives me a chance to edit. But suggested text definitely bumps it up in priority: if you write it, I'll post it.

jpivarski commented 4 years ago
starts = np.asarray(original.content.starts)

Throws an error with the same array: AttributeError: 'awkward1._ext.NumpyArray' object has no attribute 'starts'

That's where you're getting into trickiness-squared. The different node types in a layout have different properties: NumpyArray represents rectilinear data, like NumPy, which has no need of starts and stops. ListArray is a fully general jaggedness implementation and ListOffsetArray is the common case in which starts = offsets[:-1] and stops = offsets[1:] for some monotonically increasing offsets array with length N+1 (N is the length of the logical array). These links provide more information about the layout classes, but keep in mind that everything under layout is semi-internal. (It doesn't start with an underscore because it's public API, but it's for framework developers, not data analysts.)

drahnreb commented 4 years ago

Just realized that your suggested array slicing won't result in what I meant by index slicing.

You can certainly do

>>> array["y", [[0], [0], [1], [1], [2], [2]]]
<Array [[[[1]]], [[[1, ... [[[21, 22, 23]]]] type='6 * 1 * var * var * int64'>

This will result in a select of entire nested lists by a collection of indexes (np.take). But I meant to get items of each nested list by its respective index (similar to the second part of your clarification - just without a range).

For simplicity, let's reconsider: Instead of accomplishing this / or the cited above:

>>> a = ak.from_iter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8], [9.9]])
>>> a[[0, 3], [True, False, True]]
<Array [1.1, 8.8] type='2 * float64'>

where we rearrange and select within nested same-size list (rather uncommon to assume rectangular set of non-jagged arrays, which would fail over a[[0, 1], [True, False, True]])

I intended to do something like this (multidimensional index slicing):

>>> idx = [0,1,2]
>>> a[[0,1,3], idx]
<Array [[1.1], [], [8.8]] type='3 * var * float64'>
# or generally over entire array
idxs = [0,1,0,2,1]
# assert a.shape[0] == len(idxs)
>>> a[list(range(a.shape[0])), idxs]
<Array [[1.1], [], [4.4], [8.8], []] type='5 * var * float64'>

or perhaps like so (treat slice array of type awkward1._ext.Index64 differently and try to slice nested lists; return [] if ak.num(.) < idx):

a[ak.layout.Index64(idxs)]
<Array [[1.1], [], [4.4], [8.8], []] type='5 * var * float64'>

Do only way I see to do this is by your suggested second approach where I set all stops = start+1 with start = idxs. Maybe there is a more elegant way?

jpivarski commented 4 years ago

You could use ak.pad_none to make each inner list have at least the right number of elements:

>>> ak.pad_none(a, 3)
<Array [[1.1, 2.2, 3.3], ... [9.9, None, None]] type='5 * var * ?float64'>

Then it would be legal to ask for [0, 1, 2] in the second dimension, because its maximum index, 2, exists:

>>> ak.pad_none(a, 3)[[0, 1, 3], [0, 1, 2]]
<Array [1.1, None, 8.8] type='3 * ?float64'>

The Numpyian thing to do when given advanced arrays in two dimensions is to "iterate over them as one" and return the elements that match—a single-depth list, as above. In your examples, it looks like you want nested lists, and you want the empty list in a to become an empty list in the output. ak.singletons, which you're already familiar with, is for that:

>>> ak.singletons(ak.pad_none(a, 3)[[0, 1, 3], [0, 1, 2]])
<Array [[1.1], [], [8.8]] type='3 * var * float64'>

In your example, you have [[1.1], [], [7.7]], but I assume that's a mistake because idx[2] is 2, which picks out the last element of [6.6, 7.7, 8.8].

Your second example would look like this then:

>>> ak.singletons(ak.pad_none(a, 3)[range(len(a)), [0, 1, 0, 2, 1]])
<Array [[1.1], [], [4.4], [8.8], []] type='5 * var * float64'>

though if it was big, you wouldn't want to do a Python range, you'd do np.arange:

>>> ak.singletons(ak.pad_none(a, 3)[np.arange(len(a)), [0, 1, 0, 2, 1]])
<Array [[1.1], [], [4.4], [8.8], []] type='5 * var * float64'>

I should warn you to stay away from a.shape. It exists as a side-effect of supporting Pandas views, but I'll be removing them, in a large part because of the badly named methods and properties it forces me to have (#350). A real "shape" property would somehow characterize the nested lists, though that can't be done as a tuple of numbers. (That's why Awkward has ak.type.) Pandas requires it to be (len(a),). I'm 99% sure I'll be deprecating Pandas views, so the shape property will be removed.