Closed jbrockmendel closed 2 years ago
_values_for_argsort and values_for_factorize
Do we need both?
I went back through https://github.com/pandas-dev/pandas/pull/20361, but it isn't very illuminating. My early version used _values_for_argsort
, but https://github.com/pandas-dev/pandas/pull/20361/commits/30941cb2afe36f3cd78801a2e046a4114773e107 changed it to _values_for_factorize
. I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize
need to be hashable.
What characteristics should _ndarray_values have? Is it needed? (#32412)
This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.
What should _from_sequence accept?
Joris has argued for a while now that it should only accept instances of ExtensionDtype.type
or NA values. We had a recent issue (perhaps around groupby?) that was caused by _from_sequence
being too permissive.
What should fillna accept? (#22954, #32414)
I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna()
to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5)
, which may reasonable be cast to float.
Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).
What characteristics should _ndarray_values have? Is it needed? (#32412)
This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.
By my count we have 6 places left where we use _ndarray_values
. #32646 removes two, #32538 removes one, #32476 removes one, and #32641 deprecates one. The last one I'll be making a PR for shortly. Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.
I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.
That is referenced indirectly in the _values_for_factorize docstring, should probably be made explicit.
The ExtensionIndex code is making an implicit assumption about values_for_argsort that should probably be made explicit: the ordering should be preserved not just within left._values_for_argsort()
but also with right._values_for_argsort()
(when right.dtype
matches left.dtype
).
Unless we can find a compelling reason to keep both _values_for_argsort and _values_for_factorize, I'd advocate dropping one of them.
What should _from_sequence accept?
Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values
I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.
As a heuristic, maybe something like "cls._from_sequence(values)
works if and only if pd.array(values)
would return cls
". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.
Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).
+1 on "drop-in replacement for 1D ndarray"; my understanding is that the E in EA is about "extending np.ndarray"
A couple other potential guiding principles (not mutually exclusive):
@jbrockmendel thanks for gathering all those
I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.
Some additional, assorted thoughts about this aspect:
_values_for_factorize
works with NA sentinel (and I don't think the NA sentinel needs to sort correctly), while _values_for_argsort
does not.
Do we assume that missing values in argsort are smallest? At least in sorting.py::nargsort
, we also still compute a mask, so there we don't make any assumption about it._values_for_factorize
, I think we should also consider making it an option to return (values, mask) in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though)_values_for_factorize
, because geometries can be factorized, but we do not implement _values_for_argsort
, because we consider geometries as "not sortable" (the hashed values from _values_for_factorize
of course have a certain order, but eg Series.sort_values
raises an error currently). Having those two methods makes this distinction possible.
That said: we could probably indicate "not-orderable" in another way as well, if needed. And I am also not fully sure we did the best thing in GeoPandas, as being able to sort still has value for non-orderable values (similar to how an unordered categories has no min/max, but can still be sorted).It's certainly good to think about if we can combine the two.
Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values
I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.
Yes, I think that in many places where we now use _from_sequence
, we should be more strict and only allow actual scalars (a related issue here is https://github.com/pandas-dev/pandas/issues/31108).
However, it's probably still useful to have the less strict version as well. Eg in pd.array(..)
with a specified dtype, you can be more liberal in the conversion.
So one option might be to keep _from_sequence
as is to for example be used in pd.array
, and introduce a new _from_scalars
that has the more strict behaviour (only accepting actual scalars). And by default that could use _from_sequence
for backwards compatibility.
Another option might be to let this more general conversion be handled by the new astype machinery (https://github.com/pandas-dev/pandas/issues/22384), since you could see a general conversion from list-like as in pd.array
as an astype from object to the specific dtype. And then we can keep _from_sequence
for the strict conversion.
(but the first option is probably easier short term, since refactoring astype is a bigger topic)
Aside, we should probably have defined methods like _from_sequence
/_from_scalars
on the dtype instead of the array.
What should fillna accept? (#22954, #32414)
I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna() to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5), which may reasonable be cast to float.
Personally, I think we should adopt the rule of fillna
being strictly filling values, and thus never to change the dtype.
If that is the case, then I think it is up to the array to accept anything it wants as fill value, as long as that can be converted to something that fits in the array (or being strict and only accept scalars, if the EA implementor prefers that).
It might be we need something similar like _can_hold_element
for other places to know what values to accept.
By my count we have 6 places left where we use _ndarray_values ... Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.
I am not that optimistic we can actually fully get rid of it without some replacement (there are cases were we need an ndarray). Although in some cases _values_for_factorize
could maybe be used instead, eg for the Index engine once we support general EAs in Index.
For the "bigger" topics discussed here (like the factorize/argsort values), it might be worth to start separate the discussion?
It looks like we have a special Categorical._values_for_rank
used in factorize, do we need this more generally?
It looks like we have a special Categorical._values_for_rank used in factorize, do we need this more generally?
Or putting this the other way around: could rank
rather use _values_for_argsort
instead? (I suppose this _values_for_rank
from Categorical predates the EA interface work)
_from_sequence(values)
As a heuristic, maybe something like "cls._from_sequence(values) works if and only if pd.array(values) would return cls". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.
This would break fletcher
quite a bit as nearly all scalar types we support are also types pandas
supports natively, so pd.array
would always return the pandas.array
type, not the fletcher
one.
Also there was a mention here that _from_sequence
should only work for ExtensionDtype.type
, this is also problematic. For example in the datetime case, one would like to support datetime.datetime
, np.datetime
and pd.Timestamp
. It would be good to just leave the check for valid scalars in _from_sequence
inside the ExtensionDtype
.
I think I would be able to supply a long list here would I have made further progress with fletcher
. The important thing here is, that it should be optional to override these in the ExtensionArray
. For a lot of possible implementation of ExtensionArray
, the defaults are probably better what someone would write but in the Arrow case, it would quite often make sense to overwrite them as there someone already went the extra mile and did a more efficient implementation that was tailored to Arrow's bitmask setup.
One more thing that troubles me currently is that the the dt
and str
accessors only work on the pandas-internal types. It would also be nice to optionally implement some methods from them in the ExtensionArray
.
I really like this. This was an amazing thing for being able to implement the interface.
(also: thanks for the mention; I sadly don't have the bandwidth to watch the pandas
repo but am always interested in giving input on EAs)
cc @crepererum who works on https://github.com/JDASoftwareGroup/rle-array
Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.
Is it possible for you to define .type
as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?
Also there was a mention here that
_from_sequence
should only work forExtensionDtype.type
, this is also problematic. For example in the datetime case, one would like to supportdatetime.datetime
,np.datetime
andpd.Timestamp
. It would be good to just leave the check for valid scalars in_from_sequence
inside theExtensionDtype
.
Can you clarify how that would be problematic?
And to be clear, if we do this change, we would only call _from_sequence
in places where we know the data came from the array. Not in general construction functions like pd.array(.., dtype=)
. For that, something less strict is still needed (see my comment above about either introducing a separate _from_scalars
, or either letting this general conversion be handled by astype
).
Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.
Is it possible for you to define
.type
as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?
Yes and no, we could though specify it currently as the union of dict, list, int, float, bytes, str, datetime, np.datetime, pd.Timestamp
and some more. But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.
fletcher
is here the extreme case where have FletcherContinuousArray._from_sequence
which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype
instance.
Can you clarify how that would be problematic? And to be clear, if we do this change, we would only call _from_sequence in places where we know the data came from the array.
Currently for all results of an operation on a FletcherArray
, we would also return a FletcherArray
again. In some cases it though has a different type
. For example we have __equals__(fletcher[int], fletcher[int]) -> fletcher[bool]
or __div__(fletcher[int], fletcher[int]) -> fletcher[float]
, always to carry on the valid mask into the results.
But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.
And I also don't think this should be the goal / point of ExtensionDtype.type
fletcher is here the extreme case where have FletcherContinuousArray._from_sequence which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype instance.
_from_sequence
gets the dtype
object passed, so even if the implementation is not bound to a dtype-specific class, it can use the dtype information to return the correct object.
Currently for all results of an operation on a FletcherArray, we would also return a FletcherArray again.
That's not where _from_sequence
would be used. There are occasions that you do not return a FletcherArray, such as getitem for a scalar indexer or a reduction function, right?
It is only for those cases where _from_sequence
would be used (again, the strict version would only be used for that, for general construction we still need something else).
To put it in another way: generally speaking, _from_sequence
(or a new, stricter _from_scalars
) should be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent).
So that means that EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype)
(or EA._from_sequence(list(EA), dtype=EA.dtype)
) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?
So that means that
EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype)
(orEA._from_sequence(list(EA), dtype=EA.dtype)
) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?
Yes
So for the "_from_sequence
strictness" discussion, assuming we agree that we need a strict version for certain use cases, I mentioned two options above, to be summarized as:
1) Keep _from_sequence
as is, and add a new _from_scalars
method that is more strict (that in the base class can call _from_sequence
initially for backwards compatibility). We can use _from_scalars
in those cases where we need the strict version, and keep using _from_sequence
elsewhere (eg in pd.array(.., dtype=)
)
2) Update the expectation in our spec that _from_sequence
should only accept a sequence of scalars of the array's type (so make _from_sequence
the strict method), and use the astype
machinery for construction. Basically, the current flexible _from_sequence
would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.
Are there preferences? (or other options?)
From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars
or _from_sequence
), and in both cases initially the flexible version will still be used as fallback until the update is done).
The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually.
I think my preference is for _from_sequence
to be the "parse this sequence into an array" and for _from_scalars
to be the strict version.
Are there preferences? (or other options?)
I would rather avoid having adding another constructor (im already not wild about _from_sequence_of_strings
).
IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular ndarray[i8]
which by strict rules shouldnt be accepted. Most of the internal usages of those methods have now been removed, so it should be viable to tighten those three from_sequence methods. If we do that, what will be left in the way of making _from_sequence strict(er)?
We still need a "non-strict" version for general construction (so for the same reason that our internal datetime-like ones were not strict). For example, _from_sequence
is also used in pd.array(.., dtype=)
and in astype
.
For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.
Maybe something like fromtype
that astype
can defer to? There should also be some relationship between fromtype
and _from_factorized
Well, see my option number 2 above (https://github.com/pandas-dev/pandas/issues/32586#issuecomment-603840140), which proposes to defer this to the astype
machinery. So that will involve some ways to have a casting method from a certain dtype to the EA dtype in question.
There should also be some relationship between fromtype and _from_factorized
I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users).
Regarding the "_values_for_factorize
/ _values_for_argsort
/ can they be used for more internally" discussion.
I just realized that we actually should not use them internally, ideally, according to our current EA interface (so we should fix the few cases we started using them in joining code).
The spec about factorize
is clear that there are two ways to override its behaviour: implement _values_for_factorize
/_from_factorized
, or implement factorize
itself:
Of course _values_for_factorize
still has a default (astype(object)
) implementation, but that can be very non-optimal for external EAs. So ideally, for anything factorize-related, we should actually always call the EA.factorize()
method.
Eg in the merging code where it was recently introduced, we should check if can use factorize
directly (which I suppose is possible, since we need to factorize there for the merge algo). In hash_array
, where we use it already longer time, it might be less obvious to avoid (didn't check the code).
Fletcher is an example of external EAs that implement factorize
and not _values_for_factorize
.
The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:
This is a fair point, especially in light of the fletcher example.
But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize
and comment that vff/ff pattern is a simple implementation.
so we should fix the few cases we started using them in joining code
If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort
But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize and comment that vff/ff pattern is a simple implementation.
Yes, that's also possible, and I think we discussed that at some point, but we opted for also having the _values_for_factorize
to avoid needing everyone to copy paste (and minor edit) the base class factorize
implementation. It's only a few lines of code so not that much of a problem, but it is using internal API. So we would need to expose the functionality of _factorize_array
somewhere (eg in pd.factorize()
If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort
No, I am referring to where _values_for_factorize
is used:
Now, I think for argsort
/ _values_for_argsort
we meant to have the same dual option, so the discussion also applies to that I think.
So we should think about whether we can always do with EA.factorize()
/ EA.argsort()
, or if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize), to see if we should actually add something to the EA interface like _values_for_factorize/argsort
, but required (or make one of those two required).
No, I am referring to where _values_for_factorize is used [in core.reshape.merge]
Got it, thanks. IIRC the first draft of the PR that did that used _values_for_argsort
, so that remains an option. You're right though that using a public method would be preferable.
if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize),
I think _get_engine_target
is pretty close to what you're describing. Or did you have something else in mind?
Using _values_for_argsort (_vfa) for _get_engine_target (_get) makes some implicit assumptions about _vfa that we should consider making explicit:
(ea1 <= ea2) == (ea1._vfa() <= ea2._vfa())
bad_vfa = good_vfa() - good_vfa.min()
__getitem__
, so ea._vfa()[key] == ea[key]._vfa()
Sorry for joining that late, but here are some comments from the perspective of rle-array, which is a very simple example of how EAs could use for transparent and efficient compression. Some of it is more a question or a wish, so please take it with a grain of salt. Also, thanks for the whole initiative :)
It would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around EAs. For example:
# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)
# type conversion should be handled by EA?
ea[np.array([True, False], dtype=object)]
# what about NAs?
ea[np.array([True, False, None], dtype=object)]
# and different types of ints
ea[np.array([0, 1], dtype=np.int64)]
ea[np.array([0, 1], dtype=np.uint64)]
ea[np.array([0, 1], dtype=np.int8)]
# should floats be casted if they are exact integers?
ea[np.array([0, 1], dtype=np.float64)]
# what about the massive amount of slices?
ea[:]
ea[2:15:2]
ea[-15:-1:-2]
ea[::-1]
Same goes for many of the other _
-methods mentioned in the discussion above.
For example, rle-array
returns run-length-encoded bools for comparisons. Does this really work everywhere?
# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)
mask = ea < 20
assert isinstance(mask, ExtensionArray)
df[mask] = ...
I sometimes have the impression that the builtin EAs are somewhat special (e.g. BooleanArray
) and you cannot really extend all types used by pandas.
I second what @xhochy already said: It would be nice if the base EA class would provide many methods that are based on a very small interface that the subclass must implement, but at the same time could be overwritten. rle-array
definitely would use this power to make many implementations way more efficient.
That said, it is important that the test suite keeps testing the actual downstream EA and not the example/default implementations in the base class.
It was the case the the EA data was silently casted to object-ndarrays by some DF methods. Not sure if this still the case, but for the RLE use case this means that users suddenly might run out of memory. In general, it would be nice if EAs could control the casting a little stricter. rle-array issues a performance warning every time this happens.
At least pre-1.0-pandas had a bad groupby-aggregate-performance when it came to exotic EAs because it tried 2 different fast paths before falling back to the actual python "slow-path" (which for RLE can be quite fast). It would be great if EAs could avoid this trail-and-error and directly choose the correct code path.
This is very special to RLE, but groupby operations in pandas use unstable sorting (even elements within the same group are re-shuffled), which works great for plain-old-numpy-arrays but is very painful for non-array-based EAs.
Would be great to have a test suite like this one and some help from the EA baseclass.
.dt
, .str
)Also as @xhochy already wrote, it would be nice this could be factored into some kind of baseclass or auto-implementation as well so EA authors don't need to re-implement every single date and string speciality.
For me, it is not really clear at which granularity EAs should be implemented. rle-array
uses a single class for all payload types which as far as I understand the EA docs should be ok. fletcher
does the same, but the builtin types use multiple array types and it is not clear to me why this is done.
Pandas checks if views are correctly implemented (at least to a certain degree). But: views are really complex, especially when you have an entire tree of views (multiple nesting levels and branches). Do we really require that EAs implement the whole protocol?
@crepererum w/r/t all of the issues with ea[foo]
, we are implicitly assuming numpy-like behavior w/r/t the shape of what is returned, but authors have pretty wide latitude otherwise. Making this more explicit is probably worthwhile.
I sometimes have the impression that the builtin EAs are somewhat special (e.g. BooleanArray) and you cannot really extend all types used by pandas.
We definitely have older code that treats some internal EAs specially, but we're trying to get away from that pattern. If you have examples of where this is causing trouble/ambiguity, please don't hesitate to open an issue.
It would be great if EAs could avoid this trail-and-error and directly choose the correct code path.
[sigh] yah, this is just a hard problem.
Thanks a lot for the feedback @crepererum !
Types
It would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around
All of the example you gave are related to __getitem__
, and for this specific example, we actually just added a helper function (check_array_indexer
) to preprocess the indexer:
Now, that is still targetted to handle our own internal integer and boolean dtypes, but in principle should generalize for other EAs (eg for boolean dtypes, if your dtype returns True for a _is_boolean
check, and then the EA implements a to_numpy
method, it should work for external EAs as well).
And of course, this check_array_indexer
helper function will only be helpful for the case where you can accept numpy arrays as indexer (which might be problematic in case of an rle boolean array). The helper function is mainly targetted for the case where your EA is implemented under the hood using numpy arrays, so the function is preprocessing any getitem input to the correct form to index a numpy array (eg for my own usecase in GeoPandas, this is very helpful).
I know you mentioned that those were only few example and you could give other, but so it would be helpful to give concrete examples of other things than getitem
.
And it would also be helpful to know if the above mentioned check_array_indexer
helper function is actually useful / is the kind of thing you were thinking about when asking for more helpers.
Object Casting
It was the case the the EA data was silently casted to object-ndarrays by some DF methods.
That will probably still be the case in some functionality, but in principle, that should almost always be considered as a bug / not implemented feature / something missing in the EA interface, and are welcome to be reported.
Number of EA types
There is no hard requirement here, and you are free to implement either multiple classes or a single parametrized class. So it is indeed OK to do this as you did in rle-array. For the integer dtypes we indeed have multiple classes. Personally, I find that cleaner for this specific case, so that in pandas we can keep parametrized dtypes for things that don't change the actual physical storage (the number of bits per value), but rather how things are interpreted (the resolution or timezone of datetime64, the freq of PeriodDtype). But eg CategoricalDtype is already deviating from this since it can be backed by different integer dtypes for the codes, so there is no hard rule.
NumPy Interopt
Would be great to have a test suite like this one and some help from the EA baseclass.
We do have base class tests for reductions (https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py), which also covers those reduction methods for the Series case. But I assume you specifically mean the np.
variants? That's indeed not tested right now in the base test, but I think would be welcome to add.
(althought the extra numpy keywords make this a bit more complicated)
Views
Pandas checks if views are correctly implemented (at least to a certain degree). But: views are really complex,
Can you clarify which aspect of view
is difficult to implement for your use case (basically it should "just" return a new EA object, pointing to the same underlying data, .i.e. the rle array in your case)
To what extent it is required (or how problematic it would be if this method just returned self
), I would need to check again where we actually use this.
Thanks a lot for the feedback. I'll try to file bugs for any of the mentioned cases (e.g. unwanted object casts) when I encounter them.
And of course, this check_array_indexer helper function will only be helpful for the case where you can accept numpy arrays as indexer...
I would say it is always useful:
EA type | Indexer type | what to do |
---|---|---|
RLE | RLE | use special fast-path (could still internally use helper to normalize RLE payload) |
RLE | other | use helper to normalize input, then maybe use RLE fast-path |
other | RLE | use helper (target is uncompressed anyway) |
other | other | use helper |
I know you mentioned that those were only few example and you could give other, but so it would be helpful to give concrete examples of other things than getitem. And it would also be helpful to know if the above mentioned check_array_indexer helper function is actually useful / is the kind of thing you were thinking about when asking for more helpers.
_from_sequence
(with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?_from_sequence
(without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)__setitem__
and fillna
: for the value (not the indexer), what is accepted? scalars, ok. list, ok. any iterable? for numpy objects, must dim+shape be checked? what about pandas objects?We do have base class tests for reductions (https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py), which also covers those reduction methods for the Series case. But I assume you specifically mean the np. variants?
yes
Can you clarify which aspect of view is difficult to implement for your use case (basically it should "just" return a new EA object, pointing to the same underlying data, .i.e. the rle array in your case)
__getitem__
call is fast), but can be more complicated for others. e.g. for RLE-EAs, the master copy ("underlying data") is also RL-encoded and a slice/view might point to the middle of the run. now modifying the data must be done carefully so that all other views are handled correctly. Or in other words: making the views itself efficient while still ensuring that all write operations are propagated correctly.view = array_with_billion_of_elements[1:10]
and refcount of array_with_billion_of_elements
drop to 0. are the billion of elements still there because view
is a view and the user did not call copy
or should unreferenced elements been pruned?One more that has popped up in a few places:
What should _concat_same_type
do for EAs when it has matching types but non-matching dtypes? This comes up in DatetimeArray and Categorical.
I'm not sure that concat_same_type
perhaps shouldn't handle those cases. Ideally we have a method that's called only when we can confidently concat without losing any information.
https://github.com/pandas-dev/pandas/issues/22994#issuecomment-436279333 is a proposed protocol for getting the concat method from a set of dtypes.
_from_sequence
(with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?_from_sequence
(without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)
We were actually discussing that in this issue, see eg https://github.com/pandas-dev/pandas/issues/32586#issuecomment-603840140, about the "strictness" of _from_sequence
.
But the with/without dtype distinction was not yet really touched upon, thanks for bringing that up. We should also clarify what is expected regarding that.
For continuing the "_from_sequence
strictness" discussion, I moved part of the content above to a dedicated issue: https://github.com/pandas-dev/pandas/issues/33254
I'm not sure that concat_same_type perhaps shouldn't handle those cases. Ideally we have a method that's called only when we can confidently concat without losing any information.
I'm inclined to pushing it all into the EA method. ATM we have special-case logic for this in DatetimeBlock and CategoricalBlock (and indirectly, in dtypes.concat). I don't see any other way to de-special-case pandas-internal EAs.
we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).
One more suggestion for these: The minimal logic/methods such that we don't need ExtensionBlock
subclasses for pandas-internal EAs (cc @jorisvandenbossche IIRC you advocated something like this at the last sprint)
To split off another topic from this issue, I opened a dedicated issue for the _values_for_argsort
vs _values_for_factorize
vs _ndarray_values
discussion -> https://github.com/pandas-dev/pandas/issues/33276
re concat_same_type
, there are three broad cases that SomeEA.concat_same_type
may see:
1) matching types and matching dtypes.
2) matching types but different dtypes (PA with different freqs, DTA with different tzs, Categorical with different categories, ...)
3) At least one instance of our class, but no assurance on the others (e.g. Categorical.concat_same_type
aliases concat_categorical
, which handles all kinds of things.
ATM DTA and PA raise on anything but case 1. Categorical allows any of these. IntervalArray requires matching closed
but doesn't check for matching dtypes otherwise (i doubt this is intentional).
I expect long-term we'll want EA to have some ability to determine how it concats with other types, which will require some form of delegation like what Tom described. More immediately I think we should clarify whether _concat_same_type is expected to handle case 2 or just case 1.
For reference, regarding the concatenation (concat, append, etc) of ExtensionArrays, there is a new proposal -> https://github.com/pandas-dev/pandas/pull/33607/
This introduces a mechanism for the ExtensionDtype to determine a common dtype from a set of input dtypes. And all arrays get casted to this common dtype, and then the all-same-dtype arrays get concatenated (with the existing EA._concat_same_type
in case of extension arrays, otherwise np.concatenate
).
This came up in #36400: i propose adding default implementations of where
and putmask
to EA.
Regarding putmask
, what's the use case for it that cannot be covered by eg array[mask] = values
?
Regarding putmask, what's the use case for it that cannot be covered by eg array[mask] = values ?
That would be convenient, but the putmask allows for values
that are listlike with len(values) ! len(array)
.
But still the correct length for mask
? Like:
In [30]: arr = np.array([1, 2, 3])
In [31]: arr[arr > 1] = [10, 10]
In [32]: arr
Out[32]: array([ 1, 10, 10])
But still the correct length for mask ? Like:
No. AFAICT its unrestricted. numpy will either truncate or repeat
OK, but do we have any use case for that internally to have the values
truncated or repeated?
We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics
We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics
Can you give an example where this specific numpy behaviour of truncating/repeating array-like values is used?
Doing a search for putmask, I get the following usages.
BlockManager.putmask is used in:
Block.putmask is used in (apart from BlockManager.putmask):
fillna
on eg a Series with an array-like)Index.putmask is used in:
Index.putmask is itself also a public method, but thus a method we hardly use ourselves (and I also personally never used it in practice or seen someone use it in analysis code). It's also only available on Index (eg not on Series). I suppose it tries to mimic numpy, but also deviates in several ways (it's a method while a numpy array doesn't have such a method, it returns a copy instead of working inplace, it upcasts instead of raising). I think this could also be arguments to actually deprecate it as a public Index method (which is independent of the question to add it to EA of course).
So putmask
actually has another aspect of its behaviour that was not yet clear to me: it also masks the other values.
So that is what makes it different from array[mask] = values
.
In [43]: arr = np.array([1, 2, 3, 4])
In [44]: np.putmask(arr, arr > 2, np.array([11, 12, 13, 14]))
In [45]: arr
Out[45]: array([ 1, 2, 13, 14])
which is basically equivalent (in case the other values have the same length as the array) to: array[mask] = values[mask]
.
And I suppose it is this behaviour of putmask that might used in eg update or where or replace.
Now, even with that, does that make putmask basically:
def putmask(array: ExtensionArray, mask, values):
if isarray(values) and len(values) == len(array):
array[mask] = values[mask]
else:
array[mask] = values
If it's basically the above, then I am not sure it warrants a method in the EA interface? (would there be a reason that a EA author would like to override this?)
As mentioned on the call today, there are a handful of methods missing from EA if we want to stop special-casing pandas-internal EAs.
1) replace - CategoricalBlock dispatches to Categorical.replace. I'm pretty sure this is the cause of #33484. 2) _format_native_types - DTBlock/TDBlock dispatch to_native_types to the EA's _format_native_types 3) _can_hold_element - ATM EABlock._can_hold_element always returns True For many of our EAs we now have a _validate_setitem_value that we can check for can_hold_element 4) is_view 5) quantile 6) fillna DTBlock casts to object on failure (as do non-EA blocks). We should have a single behavior.
I'd like to move forward on de-special-casing our internal EAs, particularly Categorical.replace
. To do that, I'd refactor parts of Block.replace
and use that to implement a default implementation for ExtensionArray.replace
.
Any thoughts/objections?
+1 on adding methods where appropriate and replace makes a lot of sense
replace
sounds indeed like a good candidate as EA method. I think eg for ArrowStringArray, we will also want to use a custom implementation.
I'm actually finding that if we get a correct _can_hold_element
we may not need a replace
. (though this may depend on getting a correct putmask
, not yet 100% sure if _can_hold_element gives that for free)
This is as good a time as any to revisit the "experimental" EA interface.
My read of the Issues and recollection of threads suggests there are three main groups of topics:
Clarification of the Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1) _values_for_argsort and values_for_factorize
What characteristics should _ndarray_values have? Is it needed? (#32412)_ndarray_values has been removed 3) What should _from_sequence accept?__iter__
return native types? #29738Ndarray Compat ^^^^^^^^^^^^^^^^^ 5) Headaches have been caused by trivial ndarray methods not being on EA
31199 size
32342 "T" (just the most recent; this has come up a lot)
24583 ravel
6) For arithmetic we're going to need something like either
tile
orbroadcast_to
Methods Needed/Wanted For Index/Series/DataFrame/Block ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7) Suggested Methods (partial list)
27264 duplicated
28955 apply
23179 map
22680 hasnas
I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)
cc @pandas-dev/pandas-core @xhochy