EA: revisit interface - Githubissues

jbrockmendel commented 4 years ago

This is as good a time as any to revisit the "experimental" EA interface.

My read of the Issues and recollection of threads suggests there are three main groups of topics:

Clarification of the Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1) _values_for_argsort and values_for_factorize

Do we need both? The docs both say they should be order-preserving.
Is it safe to return a view? (Categorical.values_for_argsort makes a copy for no obvious reason)
What else can they be used for internally? e.g. in #32467 _values_for_argsort is used for ExtensionIndex join_non_unique and join_monotonic 2) ~~What characteristics should _ndarray_values have? Is it needed? (#32412)~~ _ndarray_values has been removed 3) What should _from_sequence accept?
Should it only be sequences that are unambiguously this dtype?
In particular, should DTA/TDA/PA not accept i8 values? 4) What should fillna accept? (#22954, #32414) 4.5) Require that __iter__ return native types? #29738

Ndarray Compat ^^^^^^^^^^^^^^^^^ 5) Headaches have been caused by trivial ndarray methods not being on EA

31199 size
32342 "T" (just the most recent; this has come up a lot)
24583 ravel

6) For arithmetic we're going to need something like either tile or broadcast_to

Methods Needed/Wanted For Index/Series/DataFrame/Block ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7) Suggested Methods (partial list)

27264 duplicated
[x] #23437 _empty
28955 apply
23179 map
22680 hasnas
[x] #27081 equals
[x] #24144 _where
[x] _putmask would be helpful for ExtensionIndex

I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)

cc @pandas-dev/pandas-core @xhochy

TomAugspurger commented 4 years ago

_values_for_argsort and values_for_factorize

Do we need both?

I went back through https://github.com/pandas-dev/pandas/pull/20361, but it isn't very illuminating. My early version used _values_for_argsort, but https://github.com/pandas-dev/pandas/pull/20361/commits/30941cb2afe36f3cd78801a2e046a4114773e107 changed it to _values_for_factorize. I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

What characteristics should _ndarray_values have? Is it needed? (#32412)

This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.

What should _from_sequence accept?

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values. We had a recent issue (perhaps around groupby?) that was caused by _from_sequence being too permissive.

What should fillna accept? (#22954, #32414)

I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna() to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5), which may reasonable be cast to float.

Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

jbrockmendel commented 4 years ago

What characteristics should _ndarray_values have? Is it needed? (#32412)

This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.

By my count we have 6 places left where we use _ndarray_values. #32646 removes two, #32538 removes one, #32476 removes one, and #32641 deprecates one. The last one I'll be making a PR for shortly. Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.

I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

That is referenced indirectly in the _values_for_factorize docstring, should probably be made explicit.

The ExtensionIndex code is making an implicit assumption about values_for_argsort that should probably be made explicit: the ordering should be preserved not just within left._values_for_argsort() but also with right._values_for_argsort() (when right.dtype matches left.dtype).

Unless we can find a compelling reason to keep both _values_for_argsort and _values_for_factorize, I'd advocate dropping one of them.

What should _from_sequence accept?

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values

I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.

As a heuristic, maybe something like "cls._from_sequence(values) works if and only if pd.array(values) would return cls". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.

Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

+1 on "drop-in replacement for 1D ndarray"; my understanding is that the E in EA is about "extending np.ndarray"

A couple other potential guiding principles (not mutually exclusive):

The minimal set of methods so that Block/ExtensionIndex methods don't have to roundtrip EA->ndarray->EA
The minimal set of methods so that we're not special-casing Categorical all over the place.

jorisvandenbossche commented 4 years ago

@jbrockmendel thanks for gathering all those

I think the main difference (aside from the return type including the NA value) is that the values in _values_for_factorize need to be hashable.

Some additional, assorted thoughts about this aspect:

_values_for_factorize works with NA sentinel (and I don't think the NA sentinel needs to sort correctly), while _values_for_argsort does not. Do we assume that missing values in argsort are smallest? At least in sorting.py::nargsort, we also still compute a mask, so there we don't make any assumption about it.
For _values_for_factorize, I think we should also consider making it an option to return (values, mask) in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though)
In GeoPandas, we implement _values_for_factorize, because geometries can be factorized, but we do not implement _values_for_argsort, because we consider geometries as "not sortable" (the hashed values from _values_for_factorize of course have a certain order, but eg Series.sort_values raises an error currently). Having those two methods makes this distinction possible. That said: we could probably indicate "not-orderable" in another way as well, if needed. And I am also not fully sure we did the best thing in GeoPandas, as being able to sort still has value for non-orderable values (similar to how an unordered categories has no min/max, but can still be sorted).

It's certainly good to think about if we can combine the two.

Joris has argued for a while now that it should only accept instances of ExtensionDtype.type or NA values

I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world.

Yes, I think that in many places where we now use _from_sequence, we should be more strict and only allow actual scalars (a related issue here is https://github.com/pandas-dev/pandas/issues/31108). However, it's probably still useful to have the less strict version as well. Eg in pd.array(..) with a specified dtype, you can be more liberal in the conversion.

So one option might be to keep _from_sequence as is to for example be used in pd.array, and introduce a new _from_scalars that has the more strict behaviour (only accepting actual scalars). And by default that could use _from_sequence for backwards compatibility.

Another option might be to let this more general conversion be handled by the new astype machinery (https://github.com/pandas-dev/pandas/issues/22384), since you could see a general conversion from list-like as in pd.array as an astype from object to the specific dtype. And then we can keep _from_sequence for the strict conversion. (but the first option is probably easier short term, since refactoring astype is a bigger topic)

Aside, we should probably have defined methods like _from_sequence/_from_scalars on the dtype instead of the array.

What should fillna accept? (#22954, #32414)

I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for .fillna() to result in an object dtype. The obvious exception is something like IntegerArray.fillna(1.5), which may reasonable be cast to float.

Personally, I think we should adopt the rule of fillna being strictly filling values, and thus never to change the dtype. If that is the case, then I think it is up to the array to accept anything it wants as fill value, as long as that can be converted to something that fits in the array (or being strict and only accept scalars, if the EA implementor prefers that).

It might be we need something similar like _can_hold_element for other places to know what values to accept.

By my count we have 6 places left where we use _ndarray_values ... Unless there is an objection, I'd like to get rid of _ndarray_values once we no longer use it.

I am not that optimistic we can actually fully get rid of it without some replacement (there are cases were we need an ndarray). Although in some cases _values_for_factorize could maybe be used instead, eg for the Index engine once we support general EAs in Index.

For the "bigger" topics discussed here (like the factorize/argsort values), it might be worth to start separate the discussion?

jbrockmendel commented 4 years ago

It looks like we have a special Categorical._values_for_rank used in factorize, do we need this more generally?

jorisvandenbossche commented 4 years ago

It looks like we have a special Categorical._values_for_rank used in factorize, do we need this more generally?

Or putting this the other way around: could rank rather use _values_for_argsort instead? (I suppose this _values_for_rank from Categorical predates the EA interface work)

xhochy commented 4 years ago

`_from_sequence(values)`

As a heuristic, maybe something like "cls._from_sequence(values) works if and only if pd.array(values) would return cls". The heuristic breaks down for non-internal EAs until infer_dtype can handle them.

This would break fletcher quite a bit as nearly all scalar types we support are also types pandas supports natively, so pd.array would always return the pandas.array type, not the fletcher one.

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic. For example in the datetime case, one would like to support datetime.datetime, np.datetime and pd.Timestamp. It would be good to just leave the check for valid scalars in _from_sequence inside the ExtensionDtype.

Methods defined on pd.Series/DataFrame

I think I would be able to supply a long list here would I have made further progress with fletcher. The important thing here is, that it should be optional to override these in the ExtensionArray. For a lot of possible implementation of ExtensionArray, the defaults are probably better what someone would write but in the Arrow case, it would quite often make sense to overwrite them as there someone already went the extra mile and did a more efficient implementation that was tailored to Arrow's bitmask setup.

One more thing that troubles me currently is that the the dt and str accessors only work on the pandas-internal types. It would also be nice to optionally implement some methods from them in the ExtensionArray.

Test Setup

I really like this. This was an amazing thing for being able to implement the interface.

(also: thanks for the mention; I sadly don't have the bandwidth to watch the pandas repo but am always interested in giving input on EAs)

xhochy commented 4 years ago

cc @crepererum who works on https://github.com/JDASoftwareGroup/rle-array

TomAugspurger commented 4 years ago

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.

Is it possible for you to define .type as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?

jorisvandenbossche commented 4 years ago

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic. For example in the datetime case, one would like to support datetime.datetime, np.datetime and pd.Timestamp. It would be good to just leave the check for valid scalars in _from_sequence inside the ExtensionDtype.

Can you clarify how that would be problematic? And to be clear, if we do this change, we would only call _from_sequence in places where we know the data came from the array. Not in general construction functions like pd.array(.., dtype=). For that, something less strict is still needed (see my comment above about either introducing a separate _from_scalars, or either letting this general conversion be handled by astype).

xhochy commented 4 years ago

Also there was a mention here that _from_sequence should only work for ExtensionDtype.type, this is also problematic.

Is it possible for you to define .type as a metaclass that registers each scalar type, as in https://github.com/ContinuumIO/cyberpandas/blob/dbce13f94a75145a59d7a7981a8a07571a2e5eb6/cyberpandas/ip_array.py#L22-L29?

Yes and no, we could though specify it currently as the union of dict, list, int, float, bytes, str, datetime, np.datetime, pd.Timestamp and some more. But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.

fletcher is here the extreme case where have FletcherContinuousArray._from_sequence which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype instance.

Can you clarify how that would be problematic? And to be clear, if we do this change, we would only call _from_sequence in places where we know the data came from the array.

Currently for all results of an operation on a FletcherArray, we would also return a FletcherArray again. In some cases it though has a different type. For example we have __equals__(fletcher[int], fletcher[int]) -> fletcher[bool] or __div__(fletcher[int], fletcher[int]) -> fletcher[float], always to carry on the valid mask into the results.

jorisvandenbossche commented 4 years ago

But I'm not sure whether this would bring any valuable information into the interface as it a very broad range of types.

And I also don't think this should be the goal / point of ExtensionDtype.type

fletcher is here the extreme case where have FletcherContinuousArray._from_sequence which is not yet actually yet bound to a specific dtype. Depending on the scalars you pass in, you get a different ExtensionDtype instance.

_from_sequence gets the dtype object passed, so even if the implementation is not bound to a dtype-specific class, it can use the dtype information to return the correct object.

Currently for all results of an operation on a FletcherArray, we would also return a FletcherArray again.

That's not where _from_sequence would be used. There are occasions that you do not return a FletcherArray, such as getitem for a scalar indexer or a reduction function, right? It is only for those cases where _from_sequence would be used (again, the strict version would only be used for that, for general construction we still need something else).

jorisvandenbossche commented 4 years ago

To put it in another way: generally speaking, _from_sequence (or a new, stricter _from_scalars) should be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). So that means that EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype) (or EA._from_sequence(list(EA), dtype=EA.dtype)) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?

xhochy commented 4 years ago

So that means that EA._from_sequence(np.asarray(EA, dtype=object), dtype=EA.dtype) (or EA._from_sequence(list(EA), dtype=EA.dtype)) should be able to reconstruct. For such a case, I don't think you need to be able to handle datetime.datetime, np.datetime, etc ?

Yes

jorisvandenbossche commented 4 years ago

So for the "_from_sequence strictness" discussion, assuming we agree that we need a strict version for certain use cases, I mentioned two options above, to be summarized as:

1) Keep _from_sequence as is, and add a new _from_scalars method that is more strict (that in the base class can call _from_sequence initially for backwards compatibility). We can use _from_scalars in those cases where we need the strict version, and keep using _from_sequence elsewhere (eg in pd.array(.., dtype=)) 2) Update the expectation in our spec that _from_sequence should only accept a sequence of scalars of the array's type (so make _from_sequence the strict method), and use the astype machinery for construction. Basically, the current flexible _from_sequence would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars or _from_sequence), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually.

TomAugspurger commented 4 years ago

I think my preference is for _from_sequence to be the "parse this sequence into an array" and for _from_scalars to be the strict version.

jbrockmendel commented 4 years ago

Are there preferences? (or other options?)

I would rather avoid having adding another constructor (im already not wild about _from_sequence_of_strings).

IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular ndarray[i8] which by strict rules shouldnt be accepted. Most of the internal usages of those methods have now been removed, so it should be viable to tighten those three from_sequence methods. If we do that, what will be left in the way of making _from_sequence strict(er)?

jorisvandenbossche commented 4 years ago

We still need a "non-strict" version for general construction (so for the same reason that our internal datetime-like ones were not strict). For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

jbrockmendel commented 4 years ago

For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

Maybe something like fromtype that astype can defer to? There should also be some relationship between fromtype and _from_factorized

jorisvandenbossche commented 4 years ago

Well, see my option number 2 above (https://github.com/pandas-dev/pandas/issues/32586#issuecomment-603840140), which proposes to defer this to the astype machinery. So that will involve some ways to have a casting method from a certain dtype to the EA dtype in question.

There should also be some relationship between fromtype and _from_factorized

I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users).

jorisvandenbossche commented 4 years ago

Regarding the "_values_for_factorize / _values_for_argsort / can they be used for more internally" discussion. I just realized that we actually should not use them internally, ideally, according to our current EA interface (so we should fix the few cases we started using them in joining code). The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

https://github.com/pandas-dev/pandas/blob/c47e9ca8b042881d44c9e679a9bf42bacabbb732/pandas/core/arrays/base.py#L740-L747

Of course _values_for_factorize still has a default (astype(object)) implementation, but that can be very non-optimal for external EAs. So ideally, for anything factorize-related, we should actually always call the EA.factorize() method. Eg in the merging code where it was recently introduced, we should check if can use factorize directly (which I suppose is possible, since we need to factorize there for the merge algo). In hash_array, where we use it already longer time, it might be less obvious to avoid (didn't check the code).

Fletcher is an example of external EAs that implement factorize and not _values_for_factorize.

jbrockmendel commented 4 years ago

The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

This is a fair point, especially in light of the fletcher example.

But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize and comment that vff/ff pattern is a simple implementation.

so we should fix the few cases we started using them in joining code

If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort

jorisvandenbossche commented 4 years ago

But then why do we need _values_for_factorize and _from_factorize at all? We can require factorize and comment that vff/ff pattern is a simple implementation.

Yes, that's also possible, and I think we discussed that at some point, but we opted for also having the _values_for_factorize to avoid needing everyone to copy paste (and minor edit) the base class factorize implementation. It's only a few lines of code so not that much of a problem, but it is using internal API. So we would need to expose the functionality of _factorize_array somewhere (eg in pd.factorize()

If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort

No, I am referring to where _values_for_factorize is used:

https://github.com/pandas-dev/pandas/blob/9130da9ad35ba61bc4fa3ed503a7dd8d8eb1673c/pandas/core/reshape/merge.py#L1905-L1907

Now, I think for argsort / _values_for_argsort we meant to have the same dual option, so the discussion also applies to that I think.

jorisvandenbossche commented 4 years ago

So we should think about whether we can always do with EA.factorize() / EA.argsort(), or if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize), to see if we should actually add something to the EA interface like _values_for_factorize/argsort, but required (or make one of those two required).

jbrockmendel commented 4 years ago

No, I am referring to where _values_for_factorize is used [in core.reshape.merge]

Got it, thanks. IIRC the first draft of the PR that did that used _values_for_argsort, so that remains an option. You're right though that using a public method would be preferable.

if we have use cases for actually getting an ndarray of hashable/sortable values (and that are not the integers from factorize),

I think _get_engine_target is pretty close to what you're describing. Or did you have something else in mind?

Using _values_for_argsort (_vfa) for _get_engine_target (_get) makes some implicit assumptions about _vfa that we should consider making explicit:

_vfa maintain order both within an array and across arrays of the same dtype. So (ea1 <= ea2) == (ea1._vfa() <= ea2._vfa())
- A _vfa that would not satisfy this might look something like bad_vfa = good_vfa() - good_vfa.min()
_vfa commutes with __getitem__, so ea._vfa()[key] == ea[key]._vfa()

crepererum commented 4 years ago

Sorry for joining that late, but here are some comments from the perspective of rle-array, which is a very simple example of how EAs could use for transparent and efficient compression. Some of it is more a question or a wish, so please take it with a grain of salt. Also, thanks for the whole initiative :)

Types

It would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around EAs. For example:

# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)

# type conversion should be handled by EA?
ea[np.array([True, False], dtype=object)]

# what about NAs?
ea[np.array([True, False, None], dtype=object)]

# and different types of ints
ea[np.array([0, 1], dtype=np.int64)]
ea[np.array([0, 1], dtype=np.uint64)]
ea[np.array([0, 1], dtype=np.int8)]

# should floats be casted if they are exact integers?
ea[np.array([0, 1], dtype=np.float64)]

# what about the massive amount of slices?
ea[:]
ea[2:15:2]
ea[-15:-1:-2]
ea[::-1]

Same goes for many of the other _-methods mentioned in the discussion above.

What does E in EA mean?

For example, rle-array returns run-length-encoded bools for comparisons. Does this really work everywhere?

# `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)

mask = ea < 20
assert isinstance(mask, ExtensionArray)

df[mask] = ...

I sometimes have the impression that the builtin EAs are somewhat special (e.g. BooleanArray) and you cannot really extend all types used by pandas.

Sane defaults but powerful overwrites

I second what @xhochy already said: It would be nice if the base EA class would provide many methods that are based on a very small interface that the subclass must implement, but at the same time could be overwritten. rle-array definitely would use this power to make many implementations way more efficient.

That said, it is important that the test suite keeps testing the actual downstream EA and not the example/default implementations in the base class.

Object Casting

It was the case the the EA data was silently casted to object-ndarrays by some DF methods. Not sure if this still the case, but for the RLE use case this means that users suddenly might run out of memory. In general, it would be nice if EAs could control the casting a little stricter. rle-array issues a performance warning every time this happens.

EAs first vs second class

At least pre-1.0-pandas had a bad groupby-aggregate-performance when it came to exotic EAs because it tried 2 different fast paths before falling back to the actual python "slow-path" (which for RLE can be quite fast). It would be great if EAs could avoid this trail-and-error and directly choose the correct code path.

Memory Shuffles

This is very special to RLE, but groupby operations in pandas use unstable sorting (even elements within the same group are re-shuffled), which works great for plain-old-numpy-arrays but is very painful for non-array-based EAs.

NumPy Interopt

Would be great to have a test suite like this one and some help from the EA baseclass.

Special Methods (`.dt`, `.str`)

Also as @xhochy already wrote, it would be nice this could be factored into some kind of baseclass or auto-implementation as well so EA authors don't need to re-implement every single date and string speciality.

Number of EA types

For me, it is not really clear at which granularity EAs should be implemented. rle-array uses a single class for all payload types which as far as I understand the EA docs should be ok. fletcher does the same, but the builtin types use multiple array types and it is not clear to me why this is done.

Views

Pandas checks if views are correctly implemented (at least to a certain degree). But: views are really complex, especially when you have an entire tree of views (multiple nesting levels and branches). Do we really require that EAs implement the whole protocol?

jbrockmendel commented 4 years ago

@crepererum w/r/t all of the issues with ea[foo], we are implicitly assuming numpy-like behavior w/r/t the shape of what is returned, but authors have pretty wide latitude otherwise. Making this more explicit is probably worthwhile.

I sometimes have the impression that the builtin EAs are somewhat special (e.g. BooleanArray) and you cannot really extend all types used by pandas.

We definitely have older code that treats some internal EAs specially, but we're trying to get away from that pattern. If you have examples of where this is causing trouble/ambiguity, please don't hesitate to open an issue.

It would be great if EAs could avoid this trail-and-error and directly choose the correct code path.

[sigh] yah, this is just a hard problem.

jorisvandenbossche commented 4 years ago

Thanks a lot for the feedback @crepererum !

Types

It would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around

All of the example you gave are related to __getitem__, and for this specific example, we actually just added a helper function (check_array_indexer) to preprocess the indexer:

https://github.com/pandas-dev/pandas/blob/bdf969cd6487c9dd8d26c6810c7589bdc3cda1bc/pandas/core/indexers.py#L348-L361

Now, that is still targetted to handle our own internal integer and boolean dtypes, but in principle should generalize for other EAs (eg for boolean dtypes, if your dtype returns True for a _is_boolean check, and then the EA implements a to_numpy method, it should work for external EAs as well).

And of course, this check_array_indexer helper function will only be helpful for the case where you can accept numpy arrays as indexer (which might be problematic in case of an rle boolean array). The helper function is mainly targetted for the case where your EA is implemented under the hood using numpy arrays, so the function is preprocessing any getitem input to the correct form to index a numpy array (eg for my own usecase in GeoPandas, this is very helpful).

I know you mentioned that those were only few example and you could give other, but so it would be helpful to give concrete examples of other things than getitem. And it would also be helpful to know if the above mentioned check_array_indexer helper function is actually useful / is the kind of thing you were thinking about when asking for more helpers.

Object Casting

It was the case the the EA data was silently casted to object-ndarrays by some DF methods.

That will probably still be the case in some functionality, but in principle, that should almost always be considered as a bug / not implemented feature / something missing in the EA interface, and are welcome to be reported.

Number of EA types

There is no hard requirement here, and you are free to implement either multiple classes or a single parametrized class. So it is indeed OK to do this as you did in rle-array. For the integer dtypes we indeed have multiple classes. Personally, I find that cleaner for this specific case, so that in pandas we can keep parametrized dtypes for things that don't change the actual physical storage (the number of bits per value), but rather how things are interpreted (the resolution or timezone of datetime64, the freq of PeriodDtype). But eg CategoricalDtype is already deviating from this since it can be backed by different integer dtypes for the codes, so there is no hard rule.

NumPy Interopt

Would be great to have a test suite like this one and some help from the EA baseclass.

We do have base class tests for reductions (https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py), which also covers those reduction methods for the Series case. But I assume you specifically mean the np. variants? That's indeed not tested right now in the base test, but I think would be welcome to add. (althought the extra numpy keywords make this a bit more complicated)

Views

Pandas checks if views are correctly implemented (at least to a certain degree). But: views are really complex,

Can you clarify which aspect of view is difficult to implement for your use case (basically it should "just" return a new EA object, pointing to the same underlying data, .i.e. the rle array in your case) To what extent it is required (or how problematic it would be if this method just returned self), I would need to check again where we actually use this.

crepererum commented 4 years ago

Thanks a lot for the feedback. I'll try to file bugs for any of the mentioned cases (e.g. unwanted object casts) when I encounter them.

And of course, this check_array_indexer helper function will only be helpful for the case where you can accept numpy arrays as indexer...

I would say it is always useful:

EA type	Indexer type	what to do
RLE	RLE	use special fast-path (could still internally use helper to normalize RLE payload)
RLE	other	use helper to normalize input, then maybe use RLE fast-path
other	RLE	use helper (target is uncompressed anyway)
other	other	use helper

I know you mentioned that those were only few example and you could give other, but so it would be helpful to give concrete examples of other things than getitem. And it would also be helpful to know if the above mentioned check_array_indexer helper function is actually useful / is the kind of thing you were thinking about when asking for more helpers.

_from_sequence (with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?
_from_sequence (without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)
__setitem__ and fillna: for the value (not the indexer), what is accepted? scalars, ok. list, ok. any iterable? for numpy objects, must dim+shape be checked? what about pandas objects?

We do have base class tests for reductions (https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py), which also covers those reduction methods for the Series case. But I assume you specifically mean the np. variants?

yes

Can you clarify which aspect of view is difficult to implement for your use case (basically it should "just" return a new EA object, pointing to the same underlying data, .i.e. the rle array in your case)

projection of slices: this whole module sounds like something to me that should exist somewhere public in the python world
underlying data: this sounds extremely easy for numpy-driven arrays (also because the __getitem__ call is fast), but can be more complicated for others. e.g. for RLE-EAs, the master copy ("underlying data") is also RL-encoded and a slice/view might point to the middle of the run. now modifying the data must be done carefully so that all other views are handled correctly. Or in other words: making the views itself efficient while still ensuring that all write operations are propagated correctly.
pruning: what if you do view = array_with_billion_of_elements[1:10] and refcount of array_with_billion_of_elements drop to 0. are the billion of elements still there because view is a view and the user did not call copy or should unreferenced elements been pruned?

jbrockmendel commented 4 years ago

One more that has popped up in a few places:

What should _concat_same_type do for EAs when it has matching types but non-matching dtypes? This comes up in DatetimeArray and Categorical.

TomAugspurger commented 4 years ago

I'm not sure that concat_same_type perhaps shouldn't handle those cases. Ideally we have a method that's called only when we can confidently concat without losing any information.

https://github.com/pandas-dev/pandas/issues/22994#issuecomment-436279333 is a proposed protocol for getting the concat method from a set of dtypes.

jorisvandenbossche commented 4 years ago

_from_sequence (with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?

_from_sequence (without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)

We were actually discussing that in this issue, see eg https://github.com/pandas-dev/pandas/issues/32586#issuecomment-603840140, about the "strictness" of _from_sequence.

But the with/without dtype distinction was not yet really touched upon, thanks for bringing that up. We should also clarify what is expected regarding that.

For continuing the "_from_sequence strictness" discussion, I moved part of the content above to a dedicated issue: https://github.com/pandas-dev/pandas/issues/33254

jbrockmendel commented 4 years ago

I'm not sure that concat_same_type perhaps shouldn't handle those cases. Ideally we have a method that's called only when we can confidently concat without losing any information.

I'm inclined to pushing it all into the EA method. ATM we have special-case logic for this in DatetimeBlock and CategoricalBlock (and indirectly, in dtypes.concat). I don't see any other way to de-special-case pandas-internal EAs.

jbrockmendel commented 4 years ago

we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case).

One more suggestion for these: The minimal logic/methods such that we don't need ExtensionBlock subclasses for pandas-internal EAs (cc @jorisvandenbossche IIRC you advocated something like this at the last sprint)

jorisvandenbossche commented 4 years ago

To split off another topic from this issue, I opened a dedicated issue for the _values_for_argsort vs _values_for_factorize vs _ndarray_values discussion -> https://github.com/pandas-dev/pandas/issues/33276

jbrockmendel commented 4 years ago

re concat_same_type, there are three broad cases that SomeEA.concat_same_type may see:

1) matching types and matching dtypes. 2) matching types but different dtypes (PA with different freqs, DTA with different tzs, Categorical with different categories, ...) 3) At least one instance of our class, but no assurance on the others (e.g. Categorical.concat_same_type aliases concat_categorical, which handles all kinds of things.

ATM DTA and PA raise on anything but case 1. Categorical allows any of these. IntervalArray requires matching closed but doesn't check for matching dtypes otherwise (i doubt this is intentional).

I expect long-term we'll want EA to have some ability to determine how it concats with other types, which will require some form of delegation like what Tom described. More immediately I think we should clarify whether _concat_same_type is expected to handle case 2 or just case 1.

jorisvandenbossche commented 4 years ago

For reference, regarding the concatenation (concat, append, etc) of ExtensionArrays, there is a new proposal -> https://github.com/pandas-dev/pandas/pull/33607/ This introduces a mechanism for the ExtensionDtype to determine a common dtype from a set of input dtypes. And all arrays get casted to this common dtype, and then the all-same-dtype arrays get concatenated (with the existing EA._concat_same_type in case of extension arrays, otherwise np.concatenate).

jbrockmendel commented 4 years ago

This came up in #36400: i propose adding default implementations of where and putmask to EA.

jorisvandenbossche commented 4 years ago

Regarding putmask, what's the use case for it that cannot be covered by eg array[mask] = values ?

jbrockmendel commented 4 years ago

Regarding putmask, what's the use case for it that cannot be covered by eg array[mask] = values ?

That would be convenient, but the putmask allows for values that are listlike with len(values) ! len(array).

jorisvandenbossche commented 4 years ago

But still the correct length for mask ? Like:

In [30]: arr = np.array([1, 2, 3])                                                                                                                                                                                 

In [31]: arr[arr > 1] = [10, 10]                                                                                                                                                                                   

In [32]: arr                                                                                                                                                                                                       
Out[32]: array([ 1, 10, 10])

jbrockmendel commented 4 years ago

But still the correct length for mask ? Like:

No. AFAICT its unrestricted. numpy will either truncate or repeat

jorisvandenbossche commented 4 years ago

OK, but do we have any use case for that internally to have the values truncated or repeated?

jbrockmendel commented 4 years ago

We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics

jorisvandenbossche commented 4 years ago

We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics

Can you give an example where this specific numpy behaviour of truncating/repeating array-like values is used?

Doing a search for putmask, I get the following usages.

BlockManager.putmask is used in:

Series.update: other value is already aligned
NDFrame._where: not fully sure if other value is already aligned, but would expect so?

Block.putmask is used in (apart from BlockManager.putmask):

Block.fillna: I think the other value here is always a scalar? (you can't do fillna on eg a Series with an array-like)
Block.replace/_replace_coerce: not fully sure about this one (but as a user I wouldn't expect my replacement value to be repeated)

Index.putmask is used in:

Index.fillna: here the value is a scalar
Index.droplevel: value is also a scalar (np.nan)

Index.putmask is itself also a public method, but thus a method we hardly use ourselves (and I also personally never used it in practice or seen someone use it in analysis code). It's also only available on Index (eg not on Series). I suppose it tries to mimic numpy, but also deviates in several ways (it's a method while a numpy array doesn't have such a method, it returns a copy instead of working inplace, it upcasts instead of raising). I think this could also be arguments to actually deprecate it as a public Index method (which is independent of the question to add it to EA of course).

jorisvandenbossche commented 4 years ago

So putmask actually has another aspect of its behaviour that was not yet clear to me: it also masks the other values. So that is what makes it different from array[mask] = values.

In [43]: arr = np.array([1, 2, 3, 4])

In [44]: np.putmask(arr, arr > 2, np.array([11, 12, 13, 14]))

In [45]: arr  
Out[45]: array([ 1,  2, 13, 14])

which is basically equivalent (in case the other values have the same length as the array) to: array[mask] = values[mask].

And I suppose it is this behaviour of putmask that might used in eg update or where or replace.

Now, even with that, does that make putmask basically:

def putmask(array: ExtensionArray, mask, values):
    if isarray(values) and len(values) == len(array):
        array[mask] = values[mask]
    else:
        array[mask] = values

If it's basically the above, then I am not sure it warrants a method in the EA interface? (would there be a reason that a EA author would like to override this?)

jbrockmendel commented 4 years ago

As mentioned on the call today, there are a handful of methods missing from EA if we want to stop special-casing pandas-internal EAs.

1) replace - CategoricalBlock dispatches to Categorical.replace. I'm pretty sure this is the cause of #33484. 2) _format_native_types - DTBlock/TDBlock dispatch to_native_types to the EA's _format_native_types 3) _can_hold_element - ATM EABlock._can_hold_element always returns True For many of our EAs we now have a _validate_setitem_value that we can check for can_hold_element 4) is_view 5) quantile 6) fillna DTBlock casts to object on failure (as do non-EA blocks). We should have a single behavior.

jbrockmendel commented 4 years ago

I'd like to move forward on de-special-casing our internal EAs, particularly Categorical.replace. To do that, I'd refactor parts of Block.replace and use that to implement a default implementation for ExtensionArray.replace.

Any thoughts/objections?

jreback commented 4 years ago

+1 on adding methods where appropriate and replace makes a lot of sense

jorisvandenbossche commented 4 years ago

replace sounds indeed like a good candidate as EA method. I think eg for ArrowStringArray, we will also want to use a custom implementation.

jbrockmendel commented 4 years ago

I'm actually finding that if we get a correct _can_hold_element we may not need a replace. (though this may depend on getting a correct putmask, not yet 100% sure if _can_hold_element gives that for free)

pandas-dev / pandas

EA: revisit interface #32586

31199 size

32342 "T" (just the most recent; this has come up a lot)

24583 ravel

27264 duplicated

28955 apply

23179 map

22680 hasnas

`_from_sequence(values)`

Methods defined on pd.Series/DataFrame

Test Setup

Types

What does E in EA mean?

Sane defaults but powerful overwrites

Object Casting

EAs first vs second class

Memory Shuffles

NumPy Interopt

Special Methods (`.dt`, `.str`)

Number of EA types

Views

Types

Object Casting

Number of EA types

NumPy Interopt

Views

pandas-dev / pandas

EA: revisit interface #32586

31199 size

32342 "T" (just the most recent; this has come up a lot)

24583 ravel

27264 duplicated

28955 apply

23179 map

22680 hasnas

_from_sequence(values)

Methods defined on pd.Series/DataFrame

Test Setup

Types

What does E in EA mean?

Sane defaults but powerful overwrites

Object Casting

EAs first vs second class

Memory Shuffles

NumPy Interopt

Special Methods (.dt, .str)

Number of EA types

Views

Types

Object Casting

Number of EA types

NumPy Interopt

Views

`_from_sequence(values)`

Special Methods (`.dt`, `.str`)