intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
30 stars 6 forks source link

Slicing awkward column of DataFrame view not behaving as expected #27

Open ast0815 opened 1 year ago

ast0815 commented 1 year ago

Hello,

I hope the title is somewhat correct.

What I tried to do is select the first element of a an awkward column in a Pandas DataFrame and create a new column with just those elements. Because there are entries with 0 elements, I filtered those out before:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
print(df)

Unfortunately, in this case the output is not as expected, and the last entries are all just the same number repeated over and over:

      NMuon    Muon_Px    Muon_Py    Muon_Pz
0         2 -52.899456 -11.654672  -8.160793
1         1  -0.816459 -24.404259  20.199968
2         2  48.987831 -21.723139  11.168285
3         2  22.088331 -85.835464  403.84845
4         2  45.171322  67.248787 -89.695732
...     ...        ...        ...        ...
2416      1  23.913206 -35.665077  54.719437
2417      1  23.913206 -35.665077  54.719437
2418      1  23.913206 -35.665077  54.719437
2419      1  23.913206 -35.665077  54.719437
2420      1  23.913206 -35.665077  54.719437

[2362 rows x 4 columns]

If I use reset_index in between, it works as expected:

df = structured_tree.arrays(["NMuon", "Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")
df = df[df.NMuon > 0]
df = df.reset_index()
df["Muon_Px"] = df.Muon_Px.ak[:,0]
df["Muon_Py"] = df.Muon_Py.ak[:,0]
df["Muon_Pz"] = df.Muon_Pz.ak[:,0]
print(df)
      index  NMuon    Muon_Px    Muon_Py     Muon_Pz
0         0      2 -52.899456 -11.654672   -8.160793
1         1      1  -0.816459 -24.404259   20.199968
2         2      2  48.987831 -21.723139   11.168285
3         3      2  22.088331 -85.835464   403.84845
4         4      2  45.171322  67.248787  -89.695732
...     ...    ...        ...        ...         ...
2357   2416      1 -39.285824 -14.607491    61.71579
2358   2417      1  35.067146 -14.150043  160.817917
2359   2418      1 -29.756786 -15.303859   -52.66375
2360   2419      1    1.14187   63.60957  162.176315
2361   2420      1  23.913206 -35.665077   54.719437

[2362 rows x 5 columns]

It seems to me like the weird behaviour of the first instance is a bug. At least it is pretty unexpected. Or maybe I am just using the accessor wrong. I did not find an example of how to do what I want in the documentation.

I am using the latest version of awkward and awkward-pandas.

This is where I first discussed this issue in the uproot context: https://github.com/scikit-hep/uproot5/discussions/803

martindurant commented 1 year ago

I see the issue, but I'm not certain what to do about it yet. The index of the df after filtering looks like

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420],
           dtype='int64', length=2362)

i.e., the integer rows which matched the filter. When extracting .ak[:, 0], you get a new series with a fresh deafult index (effectively it has been index-reset).

We can cope with simple cases of such indexing, like the example diff below. However, as @jpivarski will tell you, the possible things you might pass to awkwards getitem is complex (e.g., fields strings, where the order of column selection and row selection can commute), and it's not entirely clear we can cover them all.

--- a/src/awkward_pandas/accessor.py
+++ b/src/awkward_pandas/accessor.py
@@ -46,11 +46,20 @@ class AwkwardAccessor:

     @property
     def array(self):
         return self.extarray._data

-    def __getitem__(self, *items):
-        ds = self.array.__getitem__(*items)
-        return pd.Series(AwkwardExtensionArray(ds))
+    def __getitem__(self, items):
+        """Extract components using awkward indexing"""
+        ds = self.array.__getitem__(items)
+        index = None
+        if items[0]:
+            if (
+                    not isinstance(items[0], str)
+                    and not (isinstance(items[0], list) and isinstance(items[0][0], str))
+            ):
+                index = self._obj.index[items[0]]
+        return pd.Series(AwkwardExtensionArray(ds), index=index)
jpivarski commented 1 year ago

Is this a follow-up from scikit-hep/awkward#803?

I'll try to think of a way to get an "index slicer" from an arbitrary slice of an Awkward Array, so that you have something to apply to the Pandas index. It would have to turn any slicer into a one-dimensional slicer somehow.

ast0815 commented 1 year ago

Is this a follow-up from scikit-hep/awkward#803?

No, at least not from my side. I came here from https://github.com/scikit-hep/uproot5/discussions/803 because I thought it makes more sense here than as a discussion in uproot.

martindurant commented 1 year ago

Is it a coincidence that both are 803? :)

I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.

jpivarski commented 1 year ago

Oh, yes, Uproot is using awkward-pandas! I'm sorry that I didn't mention this—I didn't realize that I hadn't. This was a follow-up project from @kkothari2001. I considered it a prerequisite for implementing uproot.dask with library="pd" (which isn't done yet), because that would have been much harder if we had to support Uproot's complicated "exploding" of structures into DataFrames. Now Uproot just checks a TBranch to see if it's simple enough to be NumPy and wraps it as a standard Pandas column if it is, as an awkward-pandas column if it is not. When dask-awkward and awkward-pandas work well together and with dask-dataframe, then uproot.dask would have a natural implementation for library="pd".

agoose77 commented 1 year ago

Is it a coincidence that both are 803? :)

I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.

Yes, I think there's been some confusion in authoring links; @ast0815 filed this after our discussion in uproot. I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).

jpivarski commented 1 year ago

There is a well-defined set of slicer types that are accepted by Awkward Array. It hasn't been written down in documentation, but it all happens in one place two places:

https://github.com/scikit-hep/awkward/blob/89f7686aeb242b4729994028331c9e4f7f309ab5/src/awkward/contents/content.py#L495-L613

https://github.com/scikit-hep/awkward/blob/89f7686aeb242b4729994028331c9e4f7f309ab5/src/awkward/_slicing.py#L129-L201

(The second is for handling tuple items.)

This list is:

So, which ones do you have to think about in awkward-pandas?

>>> array = np.array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[np.array([2, 0, 0, 1])]
array([2.2, 0. , 0. , 1.1])
>>> array[np.array([[2, 0], [0, 1]])]
array([[2.2, 0. ],
       [0. , 1.1]])

>>> df = pd.DataFrame({"x": [0.0, 1.1, 2.2, 3.3, 4.4]})
>>> df.loc[np.array([2, 0, 0, 1])]
     x
2  2.2
0  0.0
0  0.0
1  1.1
>>> df.loc[np.array([[2, 0], [0, 1]])]
...
ValueError: Cannot index with multidimensional key
>>> df.iloc[np.array([[2, 0], [0, 1]])]
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
>>> df[np.array([[2, 0], [0, 1]])]
...
KeyError: "None of [Index([(2, 0), (0, 1)], dtype='object')] are in the [columns]"

So while there's no output to produce, be sure to not apply a multidimensional NumPy array to the Awkward Array and then do nothing to the DataFrame index: this needs to raise an error.

>>> array = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[ak.Array([False, True, True, None, True])]
<Array [1.1, 2.2, None, 4.4] type='4 * ?float64'>
>>> array[ak.Array([2, 0, 0, None, 1])]
<Array [2.2, 0, 0, None, 1.1] type='5 * ?float64'>

Since the None values are being passed through to the output, they should still be indexed in the output DataFrame/Series. That's straightforward for the array of booleans: just ak.fill_none them as True before slicing the index.

For an array of option-type integers, it's not clear to me what should happen. The output needs to have None where there is None in the slicer, but there might not be any None values in the sliced array, and hence no index to say is associated with that None. Is there a way to make a Pandas Index with missing values? If so, all of the None values in the slicer could map to None values in the output Index.

Maybe this case can be NotImplementedError? I'm not so sure, because it's easy to end up with option-type arrays in Awkward. For instance, if you do argmin or argmax.

Those are all the cases! The only hard ones are ... (Ellipsis), which is rare and has to be deliberately given by a user, so it can be NotImplementedError, and an array of option-type integers, which is more likely to arise naturally in an analysis and should probably be addressed somehow.

jpivarski commented 1 year ago

I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).

I meant scikit-hep/uproot5#803! That's what happened. It probably also explains why I didn't see an automatic cross-link there.

martindurant commented 1 year ago

Answering this bit:

Is there a way to make a Pandas Index with missing values?

Yes you can, but you shouldn't! One of the many weir pandas cases. The most general form of the index is just like any other series; you don't need to be unique, ordered or any other particular condition. Indexing using such a series will be slow, and the presence of None (or pd.NA or nan...) will probably break something.