scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
827 stars 85 forks source link

NumPy advanced indexing is very slow for multidimensional arrays #2080

Open agoose77 opened 1 year ago

agoose77 commented 1 year ago

Version of Awkward Array

main

Description and code to reproduce

The Problem

Some context:

Here I'm talking about NumPy advanced indexing. There are two indexing modes; boolean and integer.

Awkward supports advanced indexing with the advanced mechanism and _getitem_next. In the first call to _getitem_next, Awkward performs a _carry with a one dimensional array. This recurses, and the carryd layout is sliced with a second one-dimensional array, (and so on for > 2D masks, or > 2 advanced integer ndices).

This first slice is the slowest, because we are carrying the outer dimension of the array, which, for an array of shape (N, M), performs a copy of M elements for each index that we pull out.

For an array of 4096 * 512 elements, a coin-flip mask would produce a result with ~1048576 elements. This corresponds to a copy of size 1048576 * 512 elements for this first slice, which is 4,294,967,296 bytes for 64-bit dtypes!

Fix

Boolean Indexing

It does not make sense to support boolean indexing for truly ragged arrays — the positional correspondence is lost. As such, it is trivial to fix this case; simply recurse to the leaf 1D NumpyArray node, and apply the travelled boolean mask (index, = np.where(mask.ravel())), after first ensuring that the indexed array is a regular type.

Integer Indexing

We currently support advanced integer indexing for truly ragged arrays. I'd like to advocate for dropping this in the name of simplicity and performance, but the rest of us might have a sense of whether people are actually using this feature on ragged arrays. It would certainly a breaking change.

If we did remove ragged support, then the solution to integer indexing becomes very similar to that for boolean indexing. We simply use np.ravel_multi_index() and then index the leaf as above.

If we did not remove ragged support, then either we make the current advanced support a slow-path for ragged arrays, or, we push two lists to the leaf:

A kernel / NumPy function can be used to perform a ragged equivalent of ravel_multi_index here.

There's probably more to this, but these are my first thoughts.

jpivarski commented 1 year ago

I see no reason to remove array-of-integer indexing on truly ragged arrays, even if it is slow, because there is no other way to do it. A slow implementation of a capability is better than no implementation of a capability.

But if the array is not "truly" ragged—that is, if the lengths of lists at each level are uniform, regardless of whether they're RegularArray or ListArray/ListOffsetArray—then Awkward array-of-integer and array-of-boolean indexing are effectively a slow implementation of what NumPy computes quickly. It could be valuable to have a fast path for this case. If we detect that the array being sliced is NumPy-like (i.e. ak.to_numpy would not raise an exception), then we could turn it into an actual NumPy (or CuPy, etc.) array and call its __getitem__ to do the slice. We wouldn't even need to get into details like ravel_multi_index (though this might be what NumPy uses to implement the slice): we'd just hand it a true NumPy array and let it slice it.

That would leave open one remaining case: what to do about missing values. ak.to_numpy will (with allow_missing=True) push option-type at any level down to the leaf level so that the array can be a np.ma.MaskedArray. If we use ak.to_numpy (or ak.to_cupy, etc.) to determine if an array is regular, make it a true NumPy (or CuPy, etc.) array, let the array library slice it, and wrap the result in a shallow ak.Array, then it would have the wrong type and values:

>>> array = ak.Array([[1, 2], None, [3, 4]])
>>> array
<Array [[1, 2], None, [3, 4]] type='3 * option[var * int64]'>

>>> ak.from_numpy(ak.to_numpy(array))
<Array [[1, 2], [None, None], [3, 4]] type='3 * 2 * ?int64'>

Although that can be repaired after the fact. (Insert a ByteMaskedArray at the right level.)


But, to emphasize the first point, I would not be in favor of taking away a capability that isn't available anywhere else (NumPy-like slicing of non-NumPy-like ragged arrays). I would be in favor of someday adding a fast path for the overlap in functionality between Awkward Array and NumPy, in which NumPy does it much faster.

However, that would not make the codebase simpler: we'd still need the old code path for the general case (array being sliced is not truly regular), and then we'd have an additional code path for optimization. That's why we start with general, albeit slow, code: only special cases can be optimized, and adding special cases complicates the codebase—which is fine, as long as we're satisfied that the baseline is under control.

agoose77 commented 1 year ago

OK, let's not moot removing the feature on ragged arrays.

I think there's a happy medium here where we can implement a general solution that's faster than sequential carry. I think we should be able to defer the carry until the end by building the appropriate index structures, i.e. there's a solution that doesn't involve to_numpy(). There're a lot of cases to think about, and I'm not going to work on this immediately, so I'll just leave that as an assertion!

agoose77 commented 1 year ago

To further explain what's happening here to make this a bug:

Consider an array of shape (N, M, P, Q). An advanced index of length K will perform a carry of length K into (N, M, P, Q) to produce an array of shape (K, M, P, Q). If K is of the order of the array size, then we might approximate this to be (K * M * P * Q, M, P, Q) which is clearly not going to scale well!

@jpivarski and I discussed this, and it needs to be fixed, but given that no-one has reported it, it's not the most pressing thing to solve.