scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
850 stars 89 forks source link

Extend `__getitem__` to include jagged and masked arrays in slices. #67

Closed jpivarski closed 4 years ago

jpivarski commented 4 years ago

Relies upon #66.

Follow pyarrow.Array's behavior for slicing with masked arrays (IndexedOptionArray, BitMaskedArray, and eventually ByteMaskedArray).

Will need to extend Slice hierarchy and add jagged and masked cases to Content::getitem_*.

nsmith- commented 4 years ago

Putting here an example of the pyarrow behavior:

In [1]: import pyarrow as pa

In [2]: pa.array(range(5))
Out[2]:
<pyarrow.lib.Int64Array object at 0x112289c90>
[
  0,
  1,
  2,
  3,
  4
]

In [3]: pa.array(range(5)).take(pa.array([1, None, 2]))
Out[3]:
<pyarrow.lib.Int64Array object at 0x1122dd130>
[
  1,
  null,
  2
]
jpivarski commented 4 years ago

pyarrow doesn't support it, but a logical extension should also do this:

>>> py.array(range(5)).compress(py.array([False, True, None, None, True])
[
   1,
   null,
   null,
   4
]

Of course, "compress" is a terrible name, and pyarrow's compress function does the more logical thing: lossless compression. However, when these are used in __getitem__ without special names like take and compress, the above is what a user would expect.

jpivarski commented 4 years ago

Step 1 is done (in PR #111):

>>> ak.Array(range(5))[ak.Array([1, None, 2])]
<Array [1, None, 2] type='3 * ?int64'>
jpivarski commented 4 years ago

Step 2 is done (also in PR #111):

>>> ak.Array(range(5))[ak.Array([False, True, None, None, True])]
<Array [1, None, None, 4] type='4 * ?int64'>
jpivarski commented 4 years ago

And all the jagged slices:

>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])
>>> ak.tolist(array[[[0, -1], [], [], [0, 0, 0], [-1, -2, -3]]])
[[1.1, 3.3], [], [], [6.6, 6.6, 6.6], [9.9, 8.8, 7.7]]
>>> ak.tolist(array[[[0, None, -1], [None], [], [0, None, 0], [-1, -2, -3]]])
[[1.1, None, 3.3], [None], [], [6.6, None, 6.6], [9.9, 8.8, 7.7]]
>>> ak.tolist(array[[[0, -1], None, [], [], None, [0, 0, 0], [-1, -2, -3]]])
[[1.1, 3.3], None, [], [], None, [6.6, 6.6, 6.6], [9.9, 8.8, 7.7]]
>>> ak.tolist(array[[[0, None, -1], None, [None], [], None, [0, 0, 0], [-1, -2, -3]]])
[[1.1, None, 3.3], None, [None], [], None, [6.6, 6.6, 6.6], [9.9, 8.8, 7.7]]
jpivarski commented 4 years ago

And jagged mask (almost forgot the most important case!):

>>> ak.tolist(array[[[False, False, True], [], [True, True], [False], [True, False, True]]])
[[3.3], [], [4.4, 5.5], [], [7.7, 9.9]]
jpivarski commented 4 years ago

This can also have None:

>>> ak.tolist(array[[[False, False, True], None, [], None, [True, True], [False], [True, False, True]]])
[[3.3], None, [], None, [4.4, 5.5], [], [7.7, 9.9]]
jpivarski commented 4 years ago

Getting None values in the inner layer (correctly across jagged boundaries) was more difficult, but it's done now:

>>> ak.tolist(array[[[False, True, None], [None], [None, True], [False], [True, False, True]]])
[[2.2, None], [None], [None, 5.5], [], [7.7, 9.9]]

You can even do them at both levels. :)

>>> ak.tolist(array[[[False, True, None], None, [None], None, [None, True], [False], [True, False, True]]])
[[2.2, None], None, [None], None, [None, 5.5], [], [7.7, 9.9]]

So this issue is closed. The tests/test_PR111_jagged_and_masked_getitem.py is much more extensive.