pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

Pandas 0.25.0 breaks np.isin #31080

Open lamourj opened 4 years ago

lamourj commented 4 years ago

Observed with numpy 1.17.4 as well as (latest) 1.18.1:

# pandas 0.24.2:
l = [pd.Timestamp]
pd.Timestamp == pd.Timestamp
>>> True
np.isin(l, l)
>>> array([ True])
# pandas 0.25.0
l = [pd.Timestamp]
pd.Timestamp == pd.Timestamp
>>> True
np.isin(l, l)
>>> array([ False])

Problem description

Since 0.25.0, the == of pd.Timestamp is preserved but it doesn't go through np.isin. This is observed as well under pandas 0.25.3.

jreback commented 4 years ago

i suppose; we have literally 0 support for this now

welcome to have a PR which patches with tests

lamourj commented 4 years ago

Thanks for the answer. Any ideas how you would approach this ?

jreback commented 4 years ago

i have no idea what np.isin actually does with non numpy types; not even sure why you would want to do this; Series.isin is well supported, tested and type aware

we do have numpy ufunc compatibility but don’t know how np.isin behaves

simonjayhawkins commented 4 years ago

On a quick investigation it does look like a pandas issue.

>>> np.array([pd.Timestamp]) == pd.Timestamp
False
>>>
>>> np.array([object]) == object
array([ True])
>>>
>>> np.array([5]) == 5
array([ True])
>>>
>>> import datetime
>>>
>>> np.array([datetime.datetime]) == datetime.datetime
array([ True])
>>>

i have no idea what np.isin actually does with non numpy types

it basically does..

>>> l = [pd.Timestamp]
>>>
>>> ar1 = np.asarray(l)
>>> ar1
array([<class 'pandas._libs.tslibs.timestamps.Timestamp'>], dtype=object)
>>>
>>> ar1 = np.asarray(ar1).ravel()
>>> ar1
array([<class 'pandas._libs.tslibs.timestamps.Timestamp'>], dtype=object)
>>>
>>> ar2 = np.asarray(l).ravel()
>>> ar2
array([<class 'pandas._libs.tslibs.timestamps.Timestamp'>], dtype=object)
>>>
>>> contains_object = ar1.dtype.hasobject or ar2.dtype.hasobject
>>> contains_object
True
>>>
>>> mask = np.zeros(len(ar1), dtype=bool)
>>> mask
array([False])
>>>
>>> for a in ar2:
...     mask |= ar1 == a
>>> mask
array([False])
>>>