pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

API: take interface for (Extension)Array-likes #20640

Closed jorisvandenbossche closed 6 years ago

jorisvandenbossche commented 6 years ago

Triggered by https://github.com/pandas-dev/pandas/pull/20582, I was looking at the take implementation in ExtensionArray and Categorical (which is already an ExtensionArray subclass) and in the rest of pandas:

To illustrate the difference with a small example:

In [9]: pd.Categorical(['a', 'b', 'c']).take([0, -1])
Out[9]: 
[a, NaN]
Categories (3, object): [a, b, c]

In [10]: pd.Series(['a', 'b', 'c']).take([0, -1])
Out[10]: 
0    a
1    c
dtype: object

This difference is a bit unfortunate IMO. If ExtensionArray.take is a public method (which it is right now), it would be nice if it has consistent behaviour with Series.take. If we agree on that, I was thinking about following options:

jreback commented 6 years ago

.take() was originally a numpy method so -1 indicates a position rather than a missing value. I don't think we can easily change this, so would make EA consistent here.

jorisvandenbossche commented 6 years ago

@jreback argued (for good reason) in https://github.com/pandas-dev/pandas/pull/20582 that having to deal with bounds checking / correct behaviour for empty arrays in each ExtensionArray is rather error prone and complex code in each array implementation.

So it would be good that we have in pandas a take implementation that does this for you (and can also be used in our own ExtensionArrays, to eg fix #20664). The current take_1d however does currently no bounds checks whatsoever.

take_1d is internally used in many cases where the indexer passed to take_1d is the result of eg get_indexer, so this ensures that the indexer is correct, and for those no out-of-bounds checks are needed.

If we want to expose our take implementation to extension array authors, I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking. That way, we keep on using the faster take_1d functions internally where we know bounds checks are not needed, and we can make the new take method public for external usage as well.

TomAugspurger commented 6 years ago

I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking.

+1.

TomAugspurger commented 6 years ago

Only other thing to add is that

If we need, then we can make ExtensionArray._take the official method that's part of the interface.

jorisvandenbossche commented 6 years ago

I don't want to break API on Categorical.take

Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take? We can deprecate instead of just breaking it. We in any case need to somehow fix Series.take, because now it dispatches to ExtensionArray.take, which gives wrong results for Series (because it has this difference in behaviour for -1).

TomAugspurger commented 6 years ago

Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take?

Probably not :)

We in any case need to somehow fix Series.take, because now it dispatches to ExtensionArray.take

Hmm, so you're saying that (on master) we've already changed Series[Categorical].take? I didn't realize that... OK, that makes me less certain on what do do here (making a _take vs. just changing ExtensionArray/Categorical.take. I'd support either I think.).

jorisvandenbossche commented 6 years ago

Hmm, so you're saying that (on master) we've already changed Series[Categorical].take

No, we didn't change this on master. It's an already existing bug (but only from 0.22, at least in 0.20 it was still working correctly). It's just that if we keep the difference in behaviour between Array.take and Series.take regarding -1, we should not just dispatch from Series.take to Array.take, but deal with this difference in behaviour. But for me that's a reason to actually align the default behaviour between Series and Array (Array.take still needs the filling behaviour with -1 which Series.take does not have, but it does not need to be the default).

jorisvandenbossche commented 6 years ago

@jreback @TomAugspurger more comments on this one?

What I am now thinking might be best:

jorisvandenbossche commented 6 years ago

The current example implementation (in the docstring) of:

           def take(self, indexer, allow_fill=True, fill_value=None):
               indexer = np.asarray(indexer)
               mask = indexer == -1

               # take on empty array not handled as desired by numpy
               # in case of -1 (all missing take)
               if not len(self) and mask.all():
                   return type(self)([np.nan] * len(indexer))

               result = self.data.take(indexer)
               result[mask] = np.nan  # NA for this type
               return type(self)(result)

would then become something like (not fully sure about the fill_value handling, need to check that):

           def take(self, indexer, allow_fill=True, fill_value=None):
               if fill_value is None:
                   fill_value = np.nan  # NA for this type
               result = pd.take(self.data, indexer, allow_fill=allow_fill, fill_value=fill_value)
               return type(self)(result)
jreback commented 6 years ago

I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.

Also a bit -1 (pun) on changing the semantics for this; honestly I don't think the negative indexing is useful and the filling missing values IS useful (but numpy has trouble with this as it will never change dtypes; we just make this work).

Instead I like your 2nd option, but expose API things that we want EA authors (and internal routines) to use that are not quite first-class top-level things (e.g. .factorize() and .unique() are useful generally in there own right).

maybe in: pandas.core.arrays.api ?

jorisvandenbossche commented 6 years ago

Instead I like your 2nd option

Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other. Because also for the second bullet point (exposed take function), we need to decide on the semantics (first bullet point).

Also a bit -1 (pun) on changing the semantics for this;

Above (https://github.com/pandas-dev/pandas/issues/20640#issuecomment-379814278) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).

In any case, the "most" public take, Series.take, already uses numpy semantics. And in the case of Series.take, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).

Currently we have different take semantics for different take methods. If we want to get them consistent, some will need to change.

I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.

The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api somewhere). The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).

TomAugspurger commented 6 years ago

+1 for aligning EA.take with Series & ndarray.take


From: Joris Van den Bossche notifications@github.com Sent: Monday, April 16, 2018 6:10:58 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] API: take interface for (Extension)Array-likes (#20640)

Instead I like your 2nd option

Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other. Because also for the second bullet point (exposed take function), we need to decide on the semantics (first bullet point).

Also a bit -1 (pun) on changing the semantics for this;

Above (#20640 (comment)https://github.com/pandas-dev/pandas/issues/20640#issuecomment-379814278) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).

In any case, the "most" public take, Series.take, already uses numpy semantics. And in the case of Series.take, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).

Currently we have different take semantics for different take methods. If we want to get them consistent, some will need to change.

I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.

The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api somewhere). The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/20640#issuecomment-381563903, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIk2-RDwgfDDh6-6LRm8cBcF798K8ks5tpHxCgaJpZM4TMF9x.

jorisvandenbossche commented 6 years ago

@jreback @TomAugspurger we forgot to discuss this on the dev hangout with all the subclassing/composition discussion, but any input on this? I think we should try to decide on this API before 0.23.0.

TomAugspurger commented 6 years ago

Agreed for 0.23.0

I think we should follow ndarray.take's semantics here.

@jorisvandenbossche do you want me to work on this?

jreback commented 6 years ago

yeah ok with fixing take as suggested

jorisvandenbossche commented 6 years ago

@TomAugspurger if you have time, that would be welcome (I have only limited time today and tomorrow)

jorisvandenbossche commented 6 years ago

One additional aspect I was tinkering on (but didn't raise yet) is the allow_fill=True, fill_value=None API.

I don't really "like" it, so was trying to think if there would be a nicer way to do it (since this will be new API, we don't necessarily need to follow the internal take_1d spelling).

TomAugspurger commented 6 years ago

I suppose we could do

no_default = object()

def take(self, indexer, na_value/fill_value=no_default):
    pass

for na_value / fill_value (whatever we call) of no_default, we use Python / NumPy semantics of slicing from the right.

For other na_values, we set -1 to be NA. Should we raise for other negative indexers?

jorisvandenbossche commented 6 years ago

Yep, that's what I meant with a custom singleton. I think I would be in favor of that.

For other na_values, we set -1 to be NA. Should we raise for other negative indexers?

Yes, I think so. In case of filling, only -1 should be allowed as negative value.

TomAugspurger commented 6 years ago

One slight issue with fill_value. Let's consider DecimalArray. We'd like its fill value be Decimal('NaN'), instead of np.nan. How would we get that information to DecimalArray.take?

Currently Series[DecimalArray].reindex() passes through fill_value=np.nan, since that's the default in algorithms.take_nd. Are we OK with

I'll see if that breaks anything.

TomAugspurger commented 6 years ago

To clarify, that would allow EAs to define a default fill_value as a class attribute. We could make the default np.nan.

That default value would probably go on the type?

TomAugspurger commented 6 years ago

Hmm, this doesn't look too promising.

We may just have to tell EA-authors that when you're expected to do pandas-style take, you're getting np.nan for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan to Decimal('NaN').)

jorisvandenbossche commented 6 years ago

Currently Series[DecimalArray].reindex() passes through fill_value=np.nan, since that's the default in algorithms.take_nd

I don't know if it is relevant, but reindex does already do the "correct" thing although its default fill_value is not honored:

In [23]: from pandas.tests.extension.decimal.array import DecimalArray, make_data

In [24]: arr = DecimalArray(make_data())

In [27]: s = pd.Series(arr[:3])

In [29]: s.reindex([2, 3])
Out[29]: 
2    0.31849019906500730670018128876108676195144653...
3                                                  NaN
dtype: decimal

In [30]: s.reindex([2, 3]).values
Out[30]: 
DecimalArray(array([Decimal('0.318490199065007306700181288761086761951446533203125'),
       Decimal('NaN')], dtype=object))

I suppose this is because currenlty DecimalArray.take does not honor the fill_value keyword.

Hmm, this doesn't look too promising.

We may just have to tell EA-authors that when you're expected to do pandas-style take, you're getting np.nan for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan to Decimal('NaN').)

That might indicate that the current API of allow_fill=True/False, fill_value=.. is actually not that bad? As then pandas can say: we want pandas-style take with filling without specifying a fill_value (which would then mean that they can use their default).

TomAugspurger commented 6 years ago

Here's the new implementation for DecimalArray.take

    def take(self, indexer, fill_value=_no_default):
        indexer = np.asarray(indexer)
        if fill_value is _no_default:
            # NumPy style
            result = self.data.take(indexer)
        else:
            if isinstance(fill_value, numbers.Real) and np.isnan(fill_value):
                # convert the default np.nan to NaN.
                fill_value = decimal.Decimal("NaN")
            mask = indexer == -1

            # take on empty array not handled as desired by numpy
            # in case of -1 (all missing take)
            if not len(self) and mask.all():
                return type(self)([fill_value] * len(indexer))

            result = self.data.take(indexer)
            result[mask] = fill_value

        return type(self)(result)

The bit I don't like is the convert np.nan to NaN. It means we can't do decimal_array.take(indexer, fill_value=np.nan), and actually get np.nan as the fill value (if that's a thing you want to do)...

But this does satisfy the requirements I think.

In [14]: s = pd.Series(DecimalArray([1, 2, 3]))

In [15]: s.take([0, 1, -1])
Out[15]:
0    1
1    2
2    3
dtype: decimal

In [16]: s.reindex([0, 1, -1])
Out[16]:
 0      1
 1      2
-1    NaN
dtype: decimal

I'll put up a WIP PR so it's easier to discuss.

jorisvandenbossche commented 6 years ago

IMO we should not put this in DecimalArray.take (which means: require extension authors to do something similar), but move that code (more or less) to a public take (but that can be longer term improvement of course. Most important on short term is public API on ExtensionArrays itself). Eg in the case of geopandas, I would like to use such a function on my integer array with fill_value of 0.

jorisvandenbossche commented 6 years ago

The bit I don't like is the convert np.nan to NaN

Is that because pandas will typically call EA.take with a fill_value=np.nan ? In which case almost every EA will need to just ignore that default? (or convert it like you do above)

TomAugspurger commented 6 years ago

IMO we should not put this in DecimalArray.take

Agreed. I'm trying to scope out how things should work. Then I'll work on a new take method in pandas that EA authors can use to make the implementation easier.

Is that because pandas will typically call EA.take with a fill_value=np.nan

Precisely. Specifically algos.take_nd is calling EA.take with whatever is passed to it (if anything). I briefly looked at changing the default for algos.take_nd, but things became messy.