Slicing awkward column of DataFrame view not behaving as expected

ast0815 commented 1 year ago

Hello,

I hope the title is somewhat correct.

What I tried to do is select the first element of a an awkward column in a Pandas DataFrame and create a new column with just those elements. Because there are entries with 0 elements, I filtered those out before:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
print(df)

Unfortunately, in this case the output is not as expected, and the last entries are all just the same number repeated over and over:

      NMuon    Muon_Px    Muon_Py    Muon_Pz
0         2 -52.899456 -11.654672  -8.160793
1         1  -0.816459 -24.404259  20.199968
2         2  48.987831 -21.723139  11.168285
3         2  22.088331 -85.835464  403.84845
4         2  45.171322  67.248787 -89.695732
...     ...        ...        ...        ...
2416      1  23.913206 -35.665077  54.719437
2417      1  23.913206 -35.665077  54.719437
2418      1  23.913206 -35.665077  54.719437
2419      1  23.913206 -35.665077  54.719437
2420      1  23.913206 -35.665077  54.719437

[2362 rows x 4 columns]

If I use reset_index in between, it works as expected:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df = df.reset_index()
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
print(df)

      index  NMuon    Muon_Px    Muon_Py     Muon_Pz
0         0      2 -52.899456 -11.654672   -8.160793
1         1      1  -0.816459 -24.404259   20.199968
2         2      2  48.987831 -21.723139   11.168285
3         3      2  22.088331 -85.835464   403.84845
4         4      2  45.171322  67.248787  -89.695732
...     ...    ...        ...        ...         ...
2357   2416      1 -39.285824 -14.607491    61.71579
2358   2417      1  35.067146 -14.150043  160.817917
2359   2418      1 -29.756786 -15.303859   -52.66375
2360   2419      1    1.14187   63.60957  162.176315
2361   2420      1  23.913206 -35.665077   54.719437

[2362 rows x 5 columns]

It seems to me like the weird behaviour of the first instance is a bug. At least it is pretty unexpected. Or maybe I am just using the accessor wrong. I did not find an example of how to do what I want in the documentation.

I am using the latest version of awkward and awkward-pandas.

This is where I first discussed this issue in the uproot context: https://github.com/scikit-hep/uproot5/discussions/803

martindurant commented 1 year ago

I see the issue, but I'm not certain what to do about it yet. The index of the df after filtering looks like

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420],
           dtype='int64', length=2362)

i.e., the integer rows which matched the filter. When extracting .ak[:, 0], you get a new series with a fresh deafult index (effectively it has been index-reset).

We can cope with simple cases of such indexing, like the example diff below. However, as @jpivarski will tell you, the possible things you might pass to awkwards getitem is complex (e.g., fields strings, where the order of column selection and row selection can commute), and it's not entirely clear we can cover them all.

--- a/src/awkward_pandas/accessor.py
+++ b/src/awkward_pandas/accessor.py
@@ -46,11 +46,20 @@ class AwkwardAccessor:

     @property
     def array(self):
         return self.extarray._data

-    def __getitem__(self, *items):
-        ds = self.array.__getitem__(*items)
-        return pd.Series(AwkwardExtensionArray(ds))
+    def __getitem__(self, items):
+        """Extract components using awkward indexing"""
+        ds = self.array.__getitem__(items)
+        index = None
+        if items[0]:
+            if (
+                    not isinstance(items[0], str)
+                    and not (isinstance(items[0], list) and isinstance(items[0][0], str))
+            ):
+                index = self._obj.index[items[0]]
+        return pd.Series(AwkwardExtensionArray(ds), index=index)

jpivarski commented 1 year ago

Is this a follow-up from scikit-hep/awkward#803?

I'll try to think of a way to get an "index slicer" from an arbitrary slice of an Awkward Array, so that you have something to apply to the Pandas index. It would have to turn any slicer into a one-dimensional slicer somehow.

ast0815 commented 1 year ago

Is this a follow-up from scikit-hep/awkward#803?

No, at least not from my side. I came here from https://github.com/scikit-hep/uproot5/discussions/803 because I thought it makes more sense here than as a discussion in uproot.

martindurant commented 1 year ago

Is it a coincidence that both are 803? :)

I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.

jpivarski commented 1 year ago

Oh, yes, Uproot is using awkward-pandas! I'm sorry that I didn't mention this—I didn't realize that I hadn't. This was a follow-up project from @kkothari2001. I considered it a prerequisite for implementing uproot.dask with library="pd" (which isn't done yet), because that would have been much harder if we had to support Uproot's complicated "exploding" of structures into DataFrames. Now Uproot just checks a TBranch to see if it's simple enough to be NumPy and wraps it as a standard Pandas column if it is, as an awkward-pandas column if it is not. When dask-awkward and awkward-pandas work well together and with dask-dataframe, then uproot.dask would have a natural implementation for library="pd".

agoose77 commented 1 year ago

Is it a coincidence that both are 803? :)

I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.

Yes, I think there's been some confusion in authoring links; @ast0815 filed this after our discussion in uproot. I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).

jpivarski commented 1 year ago

There is a well-defined set of slicer types that are accepted by Awkward Array. It hasn't been written down in documentation, but it all happens in ~~one place~~ two places:

https://github.com/scikit-hep/awkward/blob/89f7686aeb242b4729994028331c9e4f7f309ab5/src/awkward/contents/content.py#L495-L613

https://github.com/scikit-hep/awkward/blob/89f7686aeb242b4729994028331c9e4f7f309ab5/src/awkward/_slicing.py#L129-L201

(The second is for handling tuple items.)

This list is:

integer (picks an item)
slice (views a subrange; more complicated if step != 1)
np.newaxis/None (inserts a regular dimension of length 1)
... (inserts enough empty slices for the rest of the tuple to slice from the bottom up)
string (selects a record field)
non-tuple Sized Iterable of strings, including ak.Array if has string type (selects multiple record fields)
non-tuple Sized Iterable of booleans (NumPy-like slicing if all of this iterable's dimensions are regular)
non-tuple Sized Iterable of integers (NumPy-like slicing if all of this iterable's dimensions are regular)
tuple of the above (but not tuples of tuples!)
ak.Array of nested lists and option-types, terminating on booleans or integers (can't be in a tuple unless the tuple length is 1).

So, which ones do you have to think about in awkward-pandas?

integer: if you pick a single row from a Pandas DataFrame, you lose the index, so there should be nothing to do here.
slice: if the first tuple item (or whole slice, if not a tuple) is a Python slice object, then you want to use that same slice on the DataFrame index as as on the Awkward Array. Only the first slice in a tuple should get applied to the DataFrame index, that is.
np.newaxis/None: does not change the length, constituents, or order of the Awkward Array, so if this is the first tuple item (or whole slice), then you don't need to do anything to the DataFrame index.
...: hard to say. It might resolve to an empty slice in the first dimension or it might not. I tried looking into the Awkward code to see if I could make a general statement (based on either purelist_depth or minmax_depth), but it's complicated. If ... is the first element of a tuple, I think awkward-pandas should raise NotImplementedError for now.
string: if this is the first tuple item, look at the next tuple item as the new "first". Field-selection is orthogonal to row-selection.
non-tuple Sized Iterable of strings: same thing.
non-tuple Sized Iterable of booleans or integers: the first array in the tuple should be applied to the DataFrame index. I was wondering what Pandas does with multidimensional NumPy slicing, but apparently, it does nothing:

>>> array = np.array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[np.array([2, 0, 0, 1])]
array([2.2, 0. , 0. , 1.1])
>>> array[np.array([[2, 0], [0, 1]])]
array([[2.2, 0. ],
       [0. , 1.1]])

>>> df = pd.DataFrame({"x": [0.0, 1.1, 2.2, 3.3, 4.4]})
>>> df.loc[np.array([2, 0, 0, 1])]
     x
2  2.2
0  0.0
0  0.0
1  1.1
>>> df.loc[np.array([[2, 0], [0, 1]])]
...
ValueError: Cannot index with multidimensional key
>>> df.iloc[np.array([[2, 0], [0, 1]])]
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
>>> df[np.array([[2, 0], [0, 1]])]
...
KeyError: "None of [Index([(2, 0), (0, 1)], dtype='object')] are in the [columns]"

So while there's no output to produce, be sure to not apply a multidimensional NumPy array to the Awkward Array and then do nothing to the DataFrame index: this needs to raise an error.

ak.Array with only an option-type of booleans or integers, ?bool or ?int: if the only "awkward" thing about the boolean or integer array is that it has some missing values in it, the booleans and integers select as in NumPy but None maps to None. (This is also what pyarrow's take does.)

>>> array = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[ak.Array([False, True, True, None, True])]
<Array [1.1, 2.2, None, 4.4] type='4 * ?float64'>
>>> array[ak.Array([2, 0, 0, None, 1])]
<Array [2.2, 0, 0, None, 1.1] type='5 * ?float64'>

Since the None values are being passed through to the output, they should still be indexed in the output DataFrame/Series. That's straightforward for the array of booleans: just ak.fill_none them as True before slicing the index.

For an array of option-type integers, it's not clear to me what should happen. The output needs to have None where there is None in the slicer, but there might not be any None values in the sliced array, and hence no index to say is associated with that None. Is there a way to make a Pandas Index with missing values? If so, all of the None values in the slicer could map to None values in the output Index.

Maybe this case can be NotImplementedError? I'm not so sure, because it's easy to end up with option-type arrays in Awkward. For instance, if you do argmin or argmax.

ak.Array of lists of X: this slice is only successful if the length of the slicer (number of lists) is equal to the length of the sliced array, so there is nothing to do to the DataFrame index. The output index can be identical to the sliced array's index.
ak.Array of option-type lists of X: although this maps missing values in the slicer to missing values in the output, the slicer has to have the same length as the sliced array, there's a one-to-one correspondence between them, and so the output index can be identical to the sliced array's index.

Those are all the cases! The only hard ones are ... (Ellipsis), which is rare and has to be deliberately given by a user, so it can be NotImplementedError, and an array of option-type integers, which is more likely to arise naturally in an analysis and should probably be addressed somehow.

jpivarski commented 1 year ago

I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).

I meant scikit-hep/uproot5#803! That's what happened. It probably also explains why I didn't see an automatic cross-link there.

martindurant commented 1 year ago

Answering this bit:

Is there a way to make a Pandas Index with missing values?

Yes you can, but you shouldn't! One of the many weir pandas cases. The most general form of the index is just like any other series; you don't need to be unique, ordered or any other particular condition. Indexing using such a series will be slow, and the presence of None (or pd.NA or nan...) will probably break something.

intake / akimbo

Slicing awkward column of DataFrame view not behaving as expected #27