Open jbrockmendel opened 1 year ago
The correct place is to fix the issue is the __hash__
function of the np.datetime64. Treating datetime64 would have cost for all Python-objects, even if no datetime64-objects are used. Costs for Float and Complex aren't very high because PyFloat_CheckExact
and PyComplex_CheckExact
are very fast. Not sure there is something similarly fast for np.datetime64.
The correct place is to fix the issue is the hash function of the np.datetime64
Agreed. Unfortunately https://github.com/numpy/numpy/issues/3836 doesn't seem to be a high priority for numpy.
Not sure there is something similarly fast for np.datetime64
PyObject_TypeCheck(obj, &PyDatetimeArrType_Type)
. Since we can make this check performant, the cost in dt64-free cases shouldn't be too bad right?
@jbrockmendel you probably mean PyDatetimeScalarObject (https://github.com/numpy/numpy/blob/548bc6826b597ab79b9c1451b79ec8d23db9d444/numpy/core/include/numpy/arrayscalars.h#L119-L123)?
Really, it doesn't look too bad for performance (I wrongly assumed we would need to go via isinstance
)...
you probably mean PyDatetimeScalarObject
I wish. PyObject_TypeCheck(obj, &PyDatetimeArrType_Type)
is how we check for np.datetime64 objects in our cython code (grep is_datetime64_object). Maybe we can do something in C that we can't in cython?
@jbrockmendel Oh, you are correct.
I guess we could do something for "sane" cases. np.datatime64
is quite a strange class, see for example:
import numpy as np
x=np.datetime64(2_000_000_000_000, "Y")
y=np.datetime64(7_773_671_778_871_345_152, "s")
x,y, x==y
# (numpy.datetime64('2000000001970'), numpy.datetime64('246337854208-06-08T02:59:12'), True)
As you can see the objects represent quite different time, but are equal (probably due to an overflow in the compare-operator)... to ensure, that the hash in this case is the same would be quite a challenge.
A simply idea would to transform all times in attosecond ('as'), which will be in range [ -2^63*31536000*10^18 ... (2^63-1)*31536000*10^18]
) and then to calculate the hash-value.
I'd want to do roughly the same thing we did in Timestamp.__hash__
when implementing non-nano:
def hash_dt64_obj(obj):
if can_be_cast_losslessly_to_pydatetime(obj):
pydt = as_pydatetime(obj)
return hash(pydt)
if obj_reso == attoseconds:
return obj.view("i8")
try:
obj2 = astype_overflowsafe(obj, one_reso_down)
except OverflowError:
return obj.view("i8")
else:
return hash(obj2)
astype_overflowsafe
is implemented in cython, might need to move it to C to make it accessible for the hashing code
@WillAyd im trying to implement this in khash_python.h and struggling to make #include <../../tslibs/src/datetime/np_datetime.h>
work the way I expect (to get pandas_datetime_to_datetimestruct). At import time I get ImportError: dlopen(/Users/brock/Desktop/pd/bug-hash-dt64/pandas/_libs/hashtable.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_pandas_datetime_to_datetimestruct'
. Main thing I've tried is adding tseries_depends to the _libs.hashtable and _libs.interval entries in setup.py.
Thoughts?
Given your issue is with dlopen
its likely not an issue with any of your header includes (and similarly the depends argument) but rather with the arguments to the linker. I'm not entirely clear on how setuptools handles things but you might have luck trying to add the sources
argument of "pandas/_libs/tslibs/src/datetime/np_datetime.c"
that you see on some other extensions.
If that doesn't work feel free to push up a PR and can take a look. Also might help if you post the gcc command that setuptools is generating.
add the sources argument
bingo, thanks.
IIUC the issue is in the hashing of the datetime64 objects, which do not follow the invariance
x == y \Rightarrow hash(x) == hash(y)
(xref https://github.com/numpy/numpy/issues/3836)When implementing non-nanosecond support for Timestamp/Timedelta, we implemented
__hash__
to retain this invariance (at the cost of performance).Unless numpy changes its behavior, I think to fix this we need to patch how we treat datetime64 objects in our khash code, likely mirroring
Timestamp.__hash__
. cc @realead thoughts?