pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

BUG: get_loc / get_indexer with NaT and tz-aware DatetimeIndex #32572

Open kernc opened 4 years ago

kernc commented 4 years ago

Code Sample, a copy-pastable example if possible

>>> pd.date_range('2020', 'now').get_loc(pd.NaT, method='nearest')
0
    # Ok? NaT would be better to propagate.

>>> pd.date_range('2020', 'now', tz='US/Central').get_loc(pd.NaT, method='nearest')
-----------------------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pandas/core/indexes/datetimes.py", line 582, in get_loc
    return Index.get_loc(self, key, method, tolerance)
  File "pandas/core/indexes/base.py", line 2869, in get_loc
    indexer = self.get_indexer([key], method=method, tolerance=tolerance)
  File "pandas/core/indexes/base.py", line 2951, in get_indexer
    target, method=method, limit=limit, tolerance=tolerance
  File "pandas/core/indexes/base.py", line 2962, in get_indexer
    indexer = self._get_nearest_indexer(target, limit, tolerance)
  File "pandas/core/indexes/base.py", line 3046, in _get_nearest_indexer
    left_distances = np.abs(self[left_indexer] - target)
  File "pandas/core/indexes/base.py", line 2361, in __sub__
    return Index(np.array(self) - other)
  File "pandas/core/indexes/base.py", line 2367, in __rsub__
    return Index(other - Series(self))
  File "pandas/core/series.py", line 646, in __array_ufunc__
    self, ufunc, method, *inputs, **kwargs
  File "pandas/_libs/ops_dispatch.pyx", line 91, in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op
  File "pandas/core/ops/common.py", line 63, in new_method
    return method(self, other)
  File "pandas/core/ops/__init__.py", line 500, in wrapper
    result = arithmetic_op(lvalues, rvalues, op, str_rep)
  File "pandas/core/ops/array_ops.py", line 218, in arithmetic_op
    res_values = dispatch_to_extension_op(op, lvalues, rvalues)
  File "pandas/core/ops/dispatch.py", line 125, in dispatch_to_extension_op
    res_values = op(left, right)
  File "pandas/core/ops/roperator.py", line 13, in rsub
    return right - left
  File "pandas/core/arrays/datetimelike.py", line 1428, in __rsub__
    f"cannot subtract {type(self).__name__} from {type(other).__name__}"
TypeError: cannot subtract DatetimeArray from ndarray

Problem description

pd.NaT is NaT regardless of timezone.

Expected Output


>>> pd.date_range('2020', 'now').get_loc(pd.NaT, method='nearest')
NaT

>>> pd.date_range('2020', 'now', tz='US/Central').get_loc(pd.NaT, method='nearest')
NaT

Output of pd.show_versions()

pandas 1.1.0.dev0+725.gae79bb23c
jorisvandenbossche commented 4 years ago

I suppose the title is wrong?

jorisvandenbossche commented 4 years ago

Ah, sorry, I see that it is the message in the error (but still, that's not the actual issue I think). Previously in 0.25.0, there was a different (but also not good) error: "TypeError: bad operand type for abs(): 'NaTType'"

jorisvandenbossche commented 4 years ago

It seems that somewhere in the code, the datetime index is converted to object dtype, which leads to having an object dtype array with timestamps (and this gives the error about not being able to subtract a ndarray).

This happens here:

https://github.com/pandas-dev/pandas/blob/76a1710c70e42ba03c65fbc1ffdfd718981848f3/pandas/core/indexes/base.py#L2947-L2952

and we end up there, because the dtype of the index is not equal to the index of the target (dattime64[ns, tz] vs datetime64[ns]).

letapxad commented 4 years ago

take

jbrockmendel commented 1 year ago

NaT is not a sensible return type for get_loc/get_indexer. These methods return integers, masks, or slices that can be usd in positional indexing.

dti = pd.date_range('2020', 'now', tz='US/Central')
target = pd.DatetimeIndex([pd.NaT], dtype=dti.dtype)

>>> dti.get_indexer(target)
array([-1])

>>> dti.get_indexer(target, method="nearest")
array([1301])

The 1301 seems weird to me