Closed jorisvandenbossche closed 6 years ago
.take()
was originally a numpy method so -1
indicates a position rather than a missing value. I don't think we can easily change this, so would make EA consistent here.
@jreback argued (for good reason) in https://github.com/pandas-dev/pandas/pull/20582 that having to deal with bounds checking / correct behaviour for empty arrays in each ExtensionArray is rather error prone and complex code in each array implementation.
So it would be good that we have in pandas a take
implementation that does this for you (and can also be used in our own ExtensionArrays, to eg fix #20664).
The current take_1d
however does currently no bounds checks whatsoever.
take_1d
is internally used in many cases where the indexer
passed to take_1d
is the result of eg get_indexer
, so this ensures that the indexer is correct, and for those no out-of-bounds checks are needed.
If we want to expose our take
implementation to extension array authors, I would personally create a new take
functions which just calls the existing take_1d
, but additionally does first bounds checking.
That way, we keep on using the faster take_1d
functions internally where we know bounds checks are not needed, and we can make the new take
method public for external usage as well.
I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking.
+1.
Only other thing to add is that
If we need, then we can make ExtensionArray._take
the official method that's part of the interface.
I don't want to break API on Categorical.take
Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take
?
We can deprecate instead of just breaking it.
We in any case need to somehow fix Series.take
, because now it dispatches to ExtensionArray.take, which gives wrong results for Series (because it has this difference in behaviour for -1).
Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take?
Probably not :)
We in any case need to somehow fix Series.take, because now it dispatches to ExtensionArray.take
Hmm, so you're saying that (on master) we've already changed Series[Categorical].take
? I didn't realize that... OK, that makes me less certain on what do do here (making a _take
vs. just changing ExtensionArray/Categorical.take
. I'd support either I think.).
Hmm, so you're saying that (on master) we've already changed Series[Categorical].take
No, we didn't change this on master. It's an already existing bug (but only from 0.22, at least in 0.20 it was still working correctly). It's just that if we keep the difference in behaviour between Array.take and Series.take regarding -1, we should not just dispatch from Series.take to Array.take, but deal with this difference in behaviour. But for me that's a reason to actually align the default behaviour between Series and Array (Array.take still needs the filling behaviour with -1 which Series.take does not have, but it does not need to be the default).
@jreback @TomAugspurger more comments on this one?
What I am now thinking might be best:
take
implementation follow numpy semantics by default (meaning -1 gets converted to a positive indexer and does not indicate missingness). Consequences:
allow_fill
/fill_values
keywords to optionally use -1 as missing indicatortake
function
take_1d
, but does conversion of negative to positive indices by default, does bounds checking, ..factorize
and unique
) and can be used by extension array authors to apply it on their underlying values if possible.The current example implementation (in the docstring) of:
def take(self, indexer, allow_fill=True, fill_value=None):
indexer = np.asarray(indexer)
mask = indexer == -1
# take on empty array not handled as desired by numpy
# in case of -1 (all missing take)
if not len(self) and mask.all():
return type(self)([np.nan] * len(indexer))
result = self.data.take(indexer)
result[mask] = np.nan # NA for this type
return type(self)(result)
would then become something like (not fully sure about the fill_value
handling, need to check that):
def take(self, indexer, allow_fill=True, fill_value=None):
if fill_value is None:
fill_value = np.nan # NA for this type
result = pd.take(self.data, indexer, allow_fill=allow_fill, fill_value=fill_value)
return type(self)(result)
I don't think we need to elevate .take
to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc
) can do.
Also a bit -1 (pun) on changing the semantics for this; honestly I don't think the negative indexing is useful and the filling missing values IS useful (but numpy has trouble with this as it will never change dtypes; we just make this work).
Instead I like your 2nd option, but expose API things that we want EA authors (and internal routines) to use that are not quite first-class top-level things (e.g. .factorize()
and .unique()
are useful generally in there own right).
maybe in: pandas.core.arrays.api
?
Instead I like your 2nd option
Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other.
Because also for the second bullet point (exposed take
function), we need to decide on the semantics (first bullet point).
Also a bit -1 (pun) on changing the semantics for this;
Above (https://github.com/pandas-dev/pandas/issues/20640#issuecomment-379814278) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).
In any case, the "most" public take
, Series.take
, already uses numpy semantics. And in the case of Series.take
, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).
Currently we have different take
semantics for different take
methods. If we want to get them consistent, some will need to change.
I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.
The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take
function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api
somewhere).
The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).
+1 for aligning EA.take with Series & ndarray.take
From: Joris Van den Bossche notifications@github.com Sent: Monday, April 16, 2018 6:10:58 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] API: take interface for (Extension)Array-likes (#20640)
Instead I like your 2nd option
Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other. Because also for the second bullet point (exposed take function), we need to decide on the semantics (first bullet point).
Also a bit -1 (pun) on changing the semantics for this;
Above (#20640 (comment)https://github.com/pandas-dev/pandas/issues/20640#issuecomment-379814278) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).
In any case, the "most" public take, Series.take, already uses numpy semantics. And in the case of Series.take, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).
Currently we have different take semantics for different take methods. If we want to get them consistent, some will need to change.
I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.
The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api somewhere). The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/20640#issuecomment-381563903, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIk2-RDwgfDDh6-6LRm8cBcF798K8ks5tpHxCgaJpZM4TMF9x.
@jreback @TomAugspurger we forgot to discuss this on the dev hangout with all the subclassing/composition discussion, but any input on this? I think we should try to decide on this API before 0.23.0.
Agreed for 0.23.0
I think we should follow ndarray.take's semantics here.
@jorisvandenbossche do you want me to work on this?
yeah ok with fixing take as suggested
@TomAugspurger if you have time, that would be welcome (I have only limited time today and tomorrow)
One additional aspect I was tinkering on (but didn't raise yet) is the allow_fill=True, fill_value=None
API.
I don't really "like" it, so was trying to think if there would be a nicer way to do it (since this will be new API, we don't necessarily need to follow the internal take_1d
spelling).
allow_fill
default will be False), which doesn't feel as the best API design.fill_value
of None is actually a valid argument (as fill value), so cannot be used to indicate that there should be no filling (unless we use a custom singleton instead of None for the default)fill_value
a bit confusing, and I seem to remember we had a similar discussion on ExtensionArray related PR about fill_value
vs na_value
reindex
also use fill_value
I suppose we could do
no_default = object()
def take(self, indexer, na_value/fill_value=no_default):
pass
for na_value / fill_value (whatever we call) of no_default, we use Python / NumPy semantics of slicing from the right.
For other na_values, we set -1 to be NA. Should we raise for other negative indexers?
Yep, that's what I meant with a custom singleton. I think I would be in favor of that.
For other na_values, we set -1 to be NA. Should we raise for other negative indexers?
Yes, I think so. In case of filling, only -1 should be allowed as negative value.
One slight issue with fill_value
. Let's consider DecimalArray
. We'd like its fill value be Decimal('NaN')
, instead of np.nan
. How would we get that information to DecimalArray.take
?
Currently Series[DecimalArray].reindex()
passes through fill_value=np.nan
, since that's the default in algorithms.take_nd
. Are we OK with
take_nd
's fill_value
.take_nd
's fill_value to ExtensionArray.take
if it isn't the default?I'll see if that breaks anything.
To clarify, that would allow EAs to define a default fill_value
as a class attribute. We could make the default np.nan
.
That default value would probably go on the type?
Hmm, this doesn't look too promising.
We may just have to tell EA-authors that when you're expected to do pandas-style take
, you're getting np.nan
for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan
to Decimal('NaN')
.)
Currently Series[DecimalArray].reindex() passes through fill_value=np.nan, since that's the default in algorithms.take_nd
I don't know if it is relevant, but reindex
does already do the "correct" thing although its default fill_value
is not honored:
In [23]: from pandas.tests.extension.decimal.array import DecimalArray, make_data
In [24]: arr = DecimalArray(make_data())
In [27]: s = pd.Series(arr[:3])
In [29]: s.reindex([2, 3])
Out[29]:
2 0.31849019906500730670018128876108676195144653...
3 NaN
dtype: decimal
In [30]: s.reindex([2, 3]).values
Out[30]:
DecimalArray(array([Decimal('0.318490199065007306700181288761086761951446533203125'),
Decimal('NaN')], dtype=object))
I suppose this is because currenlty DecimalArray.take
does not honor the fill_value
keyword.
Hmm, this doesn't look too promising.
We may just have to tell EA-authors that when you're expected to do pandas-style take, you're getting np.nan for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan to Decimal('NaN').)
That might indicate that the current API of allow_fill=True/False, fill_value=..
is actually not that bad? As then pandas can say: we want pandas-style take with filling without specifying a fill_value (which would then mean that they can use their default).
Here's the new implementation for DecimalArray.take
def take(self, indexer, fill_value=_no_default):
indexer = np.asarray(indexer)
if fill_value is _no_default:
# NumPy style
result = self.data.take(indexer)
else:
if isinstance(fill_value, numbers.Real) and np.isnan(fill_value):
# convert the default np.nan to NaN.
fill_value = decimal.Decimal("NaN")
mask = indexer == -1
# take on empty array not handled as desired by numpy
# in case of -1 (all missing take)
if not len(self) and mask.all():
return type(self)([fill_value] * len(indexer))
result = self.data.take(indexer)
result[mask] = fill_value
return type(self)(result)
The bit I don't like is the convert np.nan to NaN
. It means we can't do decimal_array.take(indexer, fill_value=np.nan)
, and actually get np.nan
as the fill value (if that's a thing you want to do)...
But this does satisfy the requirements I think.
In [14]: s = pd.Series(DecimalArray([1, 2, 3]))
In [15]: s.take([0, 1, -1])
Out[15]:
0 1
1 2
2 3
dtype: decimal
In [16]: s.reindex([0, 1, -1])
Out[16]:
0 1
1 2
-1 NaN
dtype: decimal
I'll put up a WIP PR so it's easier to discuss.
IMO we should not put this in DecimalArray.take
(which means: require extension authors to do something similar), but move that code (more or less) to a public take
(but that can be longer term improvement of course. Most important on short term is public API on ExtensionArrays itself).
Eg in the case of geopandas, I would like to use such a function on my integer array with fill_value of 0.
The bit I don't like is the convert np.nan to NaN
Is that because pandas will typically call EA.take
with a fill_value=np.nan
?
In which case almost every EA will need to just ignore that default? (or convert it like you do above)
IMO we should not put this in DecimalArray.take
Agreed. I'm trying to scope out how things should work. Then I'll work on a new take
method in pandas that EA authors can use to make the implementation easier.
Is that because pandas will typically call EA.take with a fill_value=np.nan
Precisely. Specifically algos.take_nd
is calling EA.take with whatever is passed to it (if anything). I briefly looked at changing the default for algos.take_nd
, but things became messy.
Triggered by https://github.com/pandas-dev/pandas/pull/20582, I was looking at the
take
implementation in ExtensionArray and Categorical (which is already an ExtensionArray subclass) and in the rest of pandas:ExtensionArray.take
currently uses the "internal pandas"-like behaviour for take:-1
is an indicator for missing value (the behaviour we need for reindexing etc)Series.take
actually uses the numpy behaviour, where negative values (including-1
) start counting from the end of the array-like.To illustrate the difference with a small example:
This difference is a bit unfortunate IMO. If
ExtensionArray.take
is a public method (which it is right now), it would be nice if it has consistent behaviour withSeries.take
. If we agree on that, I was thinking about following options:ExtensionArray.take
private for now (eg require a_take
method for the interface) and keep the "internal pandas"-like behaviourExtensionArray.take
default behaviour consistent withSeries.take
, but still have theallow_fill
/fill_value
arguments so that when they are specified it has the "internal pandas"-like behavour (so that internal code that expects this behaviour which already passes those keywords keeps working)