pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

ExtensionArray support for comparisons to fundamental types and lists of those types #20659

Closed Dr-Irv closed 6 years ago

Dr-Irv commented 6 years ago

This is probably for @TomAugspurger to look at.

I am trying out an idea that uses the new ExtensionArray capability that is in master. What I would like to do is override the relational operators __le__, __ge__, etc. I'm following the cyberpandas example that @TomAugspurger published.

Using cyberpandas, I ran the following code

Code Sample, a copy-pastable example if possible

import cyberpandas as ip
import pandas as pd

v = ip.IPArray.from_pyints([1, 2, 3])
s = pd.Series(v)
try:
    res = (s <= pd.Series([4,1,3]))
except Exception as e:
    print("exception raised for comparison to Series ", e)
try:
    res = (s <= [4,1,3])
except Exception as e:
    print("exception raised for comparison to list ", e)

try:
    res = (s <= 2)
except Exception as e:
    print("exception raised for comparison to int ", e)

try:
    res = (s <= s[2])
    print("no exception raised for comparison to IPType")
except Exception as e:
    print("exception raised for comparison to IPType ", e)

The result is the following output:

exception raised for comparison to Series  invalid type comparison
exception raised for comparison to list  '<=' not supported between instances of 'IPv4Address' and 'int'
exception raised for comparison to int  '<=' not supported between instances of 'IPv4Address' and 'int'
no exception raised for comparison to IPType

Problem description

The first exception is expected, because cyberpandas.ip_array.__le__ raises NotImplemented, which is then caught in pandas.core.ops.na_op() at line 1166.

The second two exceptions are a bit not expected, because I would want the IPArray.__le__() method to be called so that the NotImplemented exception is raised, but in this case, there is code in pandas.core.ops.wrapper() around line 1248 that converts the Series into a numpy array of objects, and then pandas.core.ops._comp_method_OBJECT_ARRAY() is called, which is trying to do an efficient comparison, which never calls the IPArray.__le__ method.

So the problem here is that for my application, I would like the ExtensionArray.__le__() implementation to be called when other is either a constant or a list. I think the right fix is to place a test in wrapper that says if the self.dtype is an object, then call na_op() directly.

But maybe that's not the best way to do it??

Expected Output

exception raised for comparison to Series invalid type comparison exception raised for comparison to list invalid type comparison exception raised for comparison to int invalid type comparison no exception raised for comparison to IPType

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.0.dev0+683.g402ad45da pytest: 3.4.0 pip: 9.0.1 setuptools: 38.5.1 Cython: 0.25.1 numpy: 1.14.1 scipy: 1.0.0 pyarrow: 0.8.0 xarray: None IPython: 6.2.1 sphinx: 1.7.1 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2018.3 blosc: 1.5.1 bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.0 openpyxl: 2.5.0 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.5 pymysql: 0.8.0 psycopg2: None jinja2: 2.10 s3fs: 0.1.3 fastparquet: None pandas_gbq: None pandas_datareader: None
TomAugspurger commented 6 years ago

Haven't thought about ops (comparisons or arithmetic) at all really. I think it's best to always call the EA's op, regardless of the argument type.

This should be sorted out when Period/DatetimeIndex are ported over, since those do support ops.

Dr-Irv commented 6 years ago

@TomAugspurger So is the port of Period/Datetimeindex on your "to-do" list? Is it intended for 0.23?

TomAugspurger commented 6 years ago

Yes, I think so (there's a dev chat tomorrow at 9:00 Eastern where we'll finalize these things, if you're able to attend).

On Wed, Apr 11, 2018 at 1:39 PM, Dr. Irv notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger So is the port of Period/Datetimeindex on your "to-do" list? Is it intended for 0.23?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20659#issuecomment-380554300, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItV7rO1dVllN0aoF_O32WP8oz_E1ks5tnk3WgaJpZM4TQgJt .

Dr-Irv commented 6 years ago

@TomAugspurger I could try to attend. Where do I find the details?

TomAugspurger commented 6 years ago

https://mail.python.org/pipermail/pandas-dev/2018-April/000738.html


From: Dr. Irv notifications@github.com Sent: Wednesday, April 11, 2018 2:18:43 PM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] ExtensionArray support for comparisons to fundamental types and lists of those types (#20659)

@TomAugspurgerhttps://github.com/TomAugspurger I could try to attend. Where do I find the details?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/20659#issuecomment-380565915, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIjUDDnbB-s_1PW3onm6UsDwv6ox5ks5tnlcTgaJpZM4TQgJt.

jorisvandenbossche commented 6 years ago

An initial attempt (which would be welcome to try out!), might be to add something in wrapper similar as

        elif is_categorical_dtype(self):
            # Dispatch to Categorical implementation; pd.CategoricalIndex
            # behavior is non-canonical GH#19513
            res_values = dispatch_to_index_op(op, self, other, pd.Categorical)
            return self._constructor(res_values, index=self.index,
                                     name=res_name)

but then for extension arrays:

        elif is_extension_array(self):
            # Dispatch to ExtensionArray ops
            res_values = dispatch_to_array_op(op, self, other)
            return self._constructor(res_values, index=self.index,
                                     name=res_name)

(but I am not very familiar with this part of the code)

Dr-Irv commented 6 years ago

@jorisvandenbossche that almost worked. What I needed was the is_extension_array_dtype method. I'll submit a pull request soon.

jorisvandenbossche commented 6 years ago

Good to hear! (yeah, my code snippet was only quickly written, didn't check how the exact is_ check was spelled)

Dr-Irv commented 6 years ago

Closing because of #21261