ENH: allow EA to register types for is_scalar

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.87k stars 18.01k forks source link

ENH: allow EA to register types for is_scalar #27462

Open jbrockmendel opened 5 years ago

jbrockmendel commented 5 years ago

https://github.com/pandas-dev/pandas/pull/27461#discussion_r305168936

i think we need. way for EA to hook into this for an EA scalar eg an IPaddress from cyberpandas could register a scalar i think

Before we move on this, I think we need to clarify in which situations we care about lib.is_scalar(x) vs the simpler np.ndim(x) == 0

TomAugspurger commented 5 years ago

One example is for nested data. In this case we need something like scalar_for_dtype(value, dtype), since the ndim of a "scalar" for a nested data type would be > 0.

jorisvandenbossche commented 5 years ago

Alternative for registering, could be a method on the dtype/array that can check if a value is a valid scalar?

sterlinm commented 2 years ago

Hi! I think I've run into this issue in my own attempt at building an ExtensionArray and I was curious if there'd been any changes on this or if it was something I could potentially contribute on.

I've been working on an extension array where the na_value I want to return for the ExtensionDtype is not recognized as a scalar by is_scalar. That seems to cause issues with some methods that aren't part of the ExtensionArray interface that I can't figure out how to fix (e.g. Series.where).

Is there another workaround for this that I haven't found yet? Thanks!

jbrockmendel commented 2 years ago

Is there another workaround for this that I haven't found yet?

Only thought that comes to mind is trying to replace is_scalar checks with not is_listlike checks. Last time I checked (worth double-checking since this was a while ago) is_listlike was faster than is_scalar anyway, and should be more robust to this problem.

sterlinm commented 2 years ago

Thanks very much! It looks like that change has already been made in a number of places in the most recent versions of Pandas (I was testing on 1.3).

Thanks for your help and sorry to bother you!

andrewgsavage commented 2 years ago

Now that is_list_like interprets scalars correctly, https://github.com/pandas-dev/pandas/pull/44626, this is now the main issue holding back pint-pandas.

There's a few different ways suggested in this issue since it was created. What's the suggested way to fix this at the moment?

edit: I was able to get all tests in pint-pandas passing without this, so it may not be needed.

jbrockmendel commented 1 year ago

I looked at this in April and writing up my conclusions fell through the cracks.

Many of the places where we use is_scalar (also is_list_like) are either 1) as a preliminary check if we can use this as a scalar in __setitem__ 2) to see whether we should treat it as a single label vs sequence of labels for indexing.

In the latter case, is_scalar is behaving like a faster is_hashable (58ns vs 506ns on []).

In the former, we should be able to use an EA-specific method to check if the item is a scalar that is valid for the specific array at hand. We already have something like this for most of our internal EAS (DTA, TDA, PeriodArray, Categorical, PandasArray, IntervalArray, and MaskedArray all have _validate_setitem_value. ArrowExtensionArray has _maybe_convert_setitem_value).