pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

BUG: Index.get_indexer with mixed-reso datetime64s #50690

Open jbrockmendel opened 1 year ago

jbrockmendel commented 1 year ago
import numpy as np
import pandas as pd

ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")

left = pd.Index([ms], dtype=object)
right = pd.Index([us], dtype=object)

assert left[0] == right[0]
assert (left == right).all()

>>> left[0] in right  # <- wrong
False
>>> right[0] in left  # <- wrong
False

>>> left.get_loc(right[0])  # <- raises, incorrectly
>>> right.get_loc(left[0])  # <- raises, incorrectly

>>> left.get_indexer(right)  # works correctly AFAICT bc it doesnt use hashtable

# But in a non-monotonic case...
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = pd.Index([ms, sec, day], dtype=object)

>>> left2[:1].get_indexer(right)
array([0])
>>> left2.get_indexer(right)  # <- wrong
array([-1])

IIUC the issue is in the hashing of the datetime64 objects, which do not follow the invariance x == y \Rightarrow hash(x) == hash(y) (xref https://github.com/numpy/numpy/issues/3836)

When implementing non-nanosecond support for Timestamp/Timedelta, we implemented __hash__ to retain this invariance (at the cost of performance).

Unless numpy changes its behavior, I think to fix this we need to patch how we treat datetime64 objects in our khash code, likely mirroring Timestamp.__hash__. cc @realead thoughts?

realead commented 1 year ago

The correct place is to fix the issue is the __hash__ function of the np.datetime64. Treating datetime64 would have cost for all Python-objects, even if no datetime64-objects are used. Costs for Float and Complex aren't very high because PyFloat_CheckExact and PyComplex_CheckExact are very fast. Not sure there is something similarly fast for np.datetime64.

jbrockmendel commented 1 year ago

The correct place is to fix the issue is the hash function of the np.datetime64

Agreed. Unfortunately https://github.com/numpy/numpy/issues/3836 doesn't seem to be a high priority for numpy.

Not sure there is something similarly fast for np.datetime64

PyObject_TypeCheck(obj, &PyDatetimeArrType_Type). Since we can make this check performant, the cost in dt64-free cases shouldn't be too bad right?

realead commented 1 year ago

@jbrockmendel you probably mean PyDatetimeScalarObject (https://github.com/numpy/numpy/blob/548bc6826b597ab79b9c1451b79ec8d23db9d444/numpy/core/include/numpy/arrayscalars.h#L119-L123)?

Really, it doesn't look too bad for performance (I wrongly assumed we would need to go via isinstance)...

jbrockmendel commented 1 year ago

you probably mean PyDatetimeScalarObject

I wish. PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) is how we check for np.datetime64 objects in our cython code (grep is_datetime64_object). Maybe we can do something in C that we can't in cython?

realead commented 1 year ago

@jbrockmendel Oh, you are correct.

I guess we could do something for "sane" cases. np.datatime64 is quite a strange class, see for example:

import numpy as np
x=np.datetime64(2_000_000_000_000, "Y")
y=np.datetime64(7_773_671_778_871_345_152, "s")
x,y, x==y
# (numpy.datetime64('2000000001970'), numpy.datetime64('246337854208-06-08T02:59:12'), True)

As you can see the objects represent quite different time, but are equal (probably due to an overflow in the compare-operator)... to ensure, that the hash in this case is the same would be quite a challenge.

A simply idea would to transform all times in attosecond ('as'), which will be in range [ -2^63*31536000*10^18 ... (2^63-1)*31536000*10^18]) and then to calculate the hash-value.

jbrockmendel commented 1 year ago

I'd want to do roughly the same thing we did in Timestamp.__hash__ when implementing non-nano:

def hash_dt64_obj(obj):
    if can_be_cast_losslessly_to_pydatetime(obj):
        pydt = as_pydatetime(obj)
        return hash(pydt)

    if obj_reso == attoseconds:
        return obj.view("i8")

    try:
        obj2 = astype_overflowsafe(obj, one_reso_down)
    except OverflowError:
        return obj.view("i8")
    else:
        return hash(obj2)

astype_overflowsafe is implemented in cython, might need to move it to C to make it accessible for the hashing code

jbrockmendel commented 1 year ago

@WillAyd im trying to implement this in khash_python.h and struggling to make #include <../../tslibs/src/datetime/np_datetime.h> work the way I expect (to get pandas_datetime_to_datetimestruct). At import time I get ImportError: dlopen(/Users/brock/Desktop/pd/bug-hash-dt64/pandas/_libs/hashtable.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_pandas_datetime_to_datetimestruct'. Main thing I've tried is adding tseries_depends to the _libs.hashtable and _libs.interval entries in setup.py.

Thoughts?

WillAyd commented 1 year ago

Given your issue is with dlopen its likely not an issue with any of your header includes (and similarly the depends argument) but rather with the arguments to the linker. I'm not entirely clear on how setuptools handles things but you might have luck trying to add the sources argument of "pandas/_libs/tslibs/src/datetime/np_datetime.c" that you see on some other extensions.

If that doesn't work feel free to push up a PR and can take a look. Also might help if you post the gcc command that setuptools is generating.

jbrockmendel commented 1 year ago

add the sources argument

bingo, thanks.