Open ast0815 opened 1 year ago
I see the issue, but I'm not certain what to do about it yet. The index of the df after filtering looks like
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420],
dtype='int64', length=2362)
i.e., the integer rows which matched the filter. When extracting .ak[:, 0]
, you get a new series with a fresh deafult index (effectively it has been index-reset).
We can cope with simple cases of such indexing, like the example diff below. However, as @jpivarski will tell you, the possible things you might pass to awkwards getitem is complex (e.g., fields strings, where the order of column selection and row selection can commute), and it's not entirely clear we can cover them all.
--- a/src/awkward_pandas/accessor.py
+++ b/src/awkward_pandas/accessor.py
@@ -46,11 +46,20 @@ class AwkwardAccessor:
@property
def array(self):
return self.extarray._data
- def __getitem__(self, *items):
- ds = self.array.__getitem__(*items)
- return pd.Series(AwkwardExtensionArray(ds))
+ def __getitem__(self, items):
+ """Extract components using awkward indexing"""
+ ds = self.array.__getitem__(items)
+ index = None
+ if items[0]:
+ if (
+ not isinstance(items[0], str)
+ and not (isinstance(items[0], list) and isinstance(items[0][0], str))
+ ):
+ index = self._obj.index[items[0]]
+ return pd.Series(AwkwardExtensionArray(ds), index=index)
Is this a follow-up from scikit-hep/awkward#803?
I'll try to think of a way to get an "index slicer" from an arbitrary slice of an Awkward Array, so that you have something to apply to the Pandas index. It would have to turn any slicer into a one-dimensional slicer somehow.
Is this a follow-up from scikit-hep/awkward#803?
No, at least not from my side. I came here from https://github.com/scikit-hep/uproot5/discussions/803 because I thought it makes more sense here than as a discussion in uproot
.
Is it a coincidence that both are 803? :)
I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.
Oh, yes, Uproot is using awkward-pandas! I'm sorry that I didn't mention this—I didn't realize that I hadn't. This was a follow-up project from @kkothari2001. I considered it a prerequisite for implementing uproot.dask
with library="pd"
(which isn't done yet), because that would have been much harder if we had to support Uproot's complicated "exploding" of structures into DataFrames. Now Uproot just checks a TBranch to see if it's simple enough to be NumPy and wraps it as a standard Pandas column if it is, as an awkward-pandas column if it is not. When dask-awkward and awkward-pandas work well together and with dask-dataframe, then uproot.dask
would have a natural implementation for library="pd"
.
Is it a coincidence that both are 803? :)
I am pleased that this package is actually directly used from uproot, I didn't realise that was the case.
Yes, I think there's been some confusion in authoring links; @ast0815 filed this after our discussion in uproot. I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).
There is a well-defined set of slicer types that are accepted by Awkward Array. It hasn't been written down in documentation, but it all happens in one place two places:
(The second is for handling tuple items.)
This list is:
step != 1
)np.newaxis
/None
(inserts a regular dimension of length 1)...
(inserts enough empty slices for the rest of the tuple to slice from the bottom up)So, which ones do you have to think about in awkward-pandas?
slice
object, then you want to use that same slice on the DataFrame index as as on the Awkward Array. Only the first slice
in a tuple should get applied to the DataFrame index, that is.np.newaxis
/None
: does not change the length, constituents, or order of the Awkward Array, so if this is the first tuple item (or whole slice), then you don't need to do anything to the DataFrame index....
: hard to say. It might resolve to an empty slice in the first dimension or it might not. I tried looking into the Awkward code to see if I could make a general statement (based on either purelist_depth
or minmax_depth
), but it's complicated. If ...
is the first element of a tuple, I think awkward-pandas should raise NotImplementedError
for now.>>> array = np.array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[np.array([2, 0, 0, 1])]
array([2.2, 0. , 0. , 1.1])
>>> array[np.array([[2, 0], [0, 1]])]
array([[2.2, 0. ],
[0. , 1.1]])
>>> df = pd.DataFrame({"x": [0.0, 1.1, 2.2, 3.3, 4.4]})
>>> df.loc[np.array([2, 0, 0, 1])]
x
2 2.2
0 0.0
0 0.0
1 1.1
>>> df.loc[np.array([[2, 0], [0, 1]])]
...
ValueError: Cannot index with multidimensional key
>>> df.iloc[np.array([[2, 0], [0, 1]])]
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
>>> df[np.array([[2, 0], [0, 1]])]
...
KeyError: "None of [Index([(2, 0), (0, 1)], dtype='object')] are in the [columns]"
So while there's no output to produce, be sure to not apply a multidimensional NumPy array to the Awkward Array and then do nothing to the DataFrame index: this needs to raise an error.
?bool
or ?int
: if the only "awkward" thing about the boolean or integer array is that it has some missing values in it, the booleans and integers select as in NumPy but None
maps to None
. (This is also what pyarrow's take
does.)>>> array = ak.Array([0.0, 1.1, 2.2, 3.3, 4.4])
>>> array[ak.Array([False, True, True, None, True])]
<Array [1.1, 2.2, None, 4.4] type='4 * ?float64'>
>>> array[ak.Array([2, 0, 0, None, 1])]
<Array [2.2, 0, 0, None, 1.1] type='5 * ?float64'>
Since the None
values are being passed through to the output, they should still be indexed in the output DataFrame/Series. That's straightforward for the array of booleans: just ak.fill_none them as True
before slicing the index.
For an array of option-type integers, it's not clear to me what should happen. The output needs to have None
where there is None
in the slicer, but there might not be any None
values in the sliced array
, and hence no index to say is associated with that None
. Is there a way to make a Pandas Index with missing values? If so, all of the None
values in the slicer could map to None
values in the output Index.
Maybe this case can be NotImplementedError
? I'm not so sure, because it's easy to end up with option-type arrays in Awkward. For instance, if you do argmin
or argmax
.
array
's index.array
, there's a one-to-one correspondence between them, and so the output index can be identical to the sliced array
's index.Those are all the cases! The only hard ones are ...
(Ellipsis), which is rare and has to be deliberately given by a user, so it can be NotImplementedError
, and an array of option-type integers, which is more likely to arise naturally in an analysis and should probably be addressed somehow.
I think scikit-hep/awkward#803 is unrelated (we're into the 2000s now).
I meant scikit-hep/uproot5#803! That's what happened. It probably also explains why I didn't see an automatic cross-link there.
Answering this bit:
Is there a way to make a Pandas Index with missing values?
Yes you can, but you shouldn't! One of the many weir pandas cases. The most general form of the index is just like any other series; you don't need to be unique, ordered or any other particular condition. Indexing using such a series will be slow, and the presence of None (or pd.NA or nan...) will probably break something.
Hello,
I hope the title is somewhat correct.
What I tried to do is select the first element of a an awkward column in a Pandas DataFrame and create a new column with just those elements. Because there are entries with 0 elements, I filtered those out before:
Unfortunately, in this case the output is not as expected, and the last entries are all just the same number repeated over and over:
If I use
reset_index
in between, it works as expected:It seems to me like the weird behaviour of the first instance is a bug. At least it is pretty unexpected. Or maybe I am just using the accessor wrong. I did not find an example of how to do what I want in the documentation.
I am using the latest version of awkward and awkward-pandas.
This is where I first discussed this issue in the
uproot
context: https://github.com/scikit-hep/uproot5/discussions/803