intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
34 stars 6 forks source link

Allow AwkwardExtensionArray to be generally slicable #20

Closed jpivarski closed 1 year ago

jpivarski commented 2 years ago

The AwkwardExtensionArray.__getitem__ method is opinionated about the types that it gets; it would be more powerful if it passes any slicer through to the underlying Awkward Array and lets that code succeed or fail, depending on slicer type.

We want to be able to take a Pandas Series that contains an Awkward Array, series, and say

series.values["x"]

or

series.values[[2, 0, 0, 1]]

and have that pass the slice "x" or [2, 0, 0, 1] to the Awkward Array. (The

Since the __getitem__ checks types,

https://github.com/intake/awkward-pandas/blob/4c1020c4d88ebccc8c2bfaad69c3b01b0aba3bac/src/awkward_pandas/array.py#L60-L67

both of these fail. (The [2, 0, 0, 1] slice would succeed if we wrapped it as an np.ndarray.)

This came up in scikit-hep/uproot5#734, editing test_0033-more-interpretations-2.py#L82 to be result.values["x"].tolist(), where result is a Series.

The strategy of __getitem__ can be to pass whatever item it gets as a slice on self._data, then check self._data to see if it's an ak.Array, an ak.Record, or other, and wrapping ak.Array or ak.Record as a new AwkwardExtensionArray. The scalar output types ("other") can be returned as-is.

Cc: @kkothari2001

douglasdavis commented 2 years ago

Thanks for raising this @jpivarski-- I'll start working on expanding __getitem__ support.

martindurant commented 2 years ago

I'll just point out, though, that series[] will not be able to support this, since pandas handles the input before passing it down to the extension array. We already have series.values.ak[...], which is I think exactly what you are after.

jpivarski commented 2 years ago

series.values.ak[...] would be fine, if documented.

douglasdavis commented 2 years ago

We already have series.values.ak[...]

I think you mean series.ak[...] :)

martindurant commented 2 years ago

er, that might be it :)

douglasdavis commented 2 years ago

Yea after looking at this code again for the first time in a number of weeks, I now remember the extension array's __getitem__ implementation is there to support common Series workflows (i.e. pd.Series is going to call down to it as mentioned by Martin), while we do want to encourage series.ak[...] over series.values[...] for more awkward style workflows. I think this makes it clear that it's time to start a user guide

jpivarski commented 2 years ago

Not sure if this needs to be a separate PR, but the AwkwardExtensionArray.__repr__, inherited from the superclass, is making as many Awkward Arrays as there are rows (or visible rows?) which would be expensive if it really is all rows, but is hard to read due to the type information in either case. If it's just a cosmetics issue and not a performance issue, printing the rows with str instead of repr would fix it.

martindurant commented 2 years ago

https://github.com/intake/awkward-pandas/pull/16 was an attempt to deal with that, we are aware that we need something better.

douglasdavis commented 2 years ago

Quick start docs which will hopefully make clear the difference between acting on the Series and the underlying array are starting to form here: https://awkward-pandas.readthedocs.io/en/latest/quickstart.html