[BUG] MultiIndex loc expects an iterable when passed Timestamp #8585

Open pbruneau opened 3 years ago

pbruneau commented 3 years ago

Describe the bug cuDF DataFrames indexed by a Timestamp range can be accessed using .loc[] without any problem. However, if the cuDF DataFrame is indexed with a MultiIndex with timestamps as the first key, .loc[] fails, when doing so causes no issue with pandas.

Steps/Code to reproduce bug The following gist holds a self-contained example. The last line of the code fails with error: TypeError: 'Timestamp' object is not iterable

Expected behavior I would expect the pandas and cuDF snippets to behave similarly.

beckernick commented 3 years ago

Thanks for including a simple reproducer gist. I've included it below for ease of access.

import pandas as pd
import cudf
import numpy as np
start = pd.Timestamp(datetime.strptime('2021-03-12 00:00+0000',  '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 03:00+0000',  '%Y-%m-%d %H:%M%z'))
timestamps = pd.date_range(start, end, freq='1H')
labels = ['A', 'B', 'C']
index = pd.MultiIndex.from_product([timestamps, labels], names=["timestamp", "label"])
value = np.random.normal(size=12)
df = pd.DataFrame(value, index=index, columns=['value'])
df_gpu = cudf.from_pandas(df)
stamp = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000',  '%Y-%m-%d %H:%M%z'))
print(df.loc[stamp]) # SUCCEEDS
print(df_gpu.loc[stamp]) # FAILS
A      1.184793
B     -0.253166
C     -0.790236
TypeError                                 Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
    234             try:
--> 235                 return self._getitem_tuple_arg(arg)
    236             except (TypeError, KeyError, IndexError, ValueError):

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/contextlib.py in inner(*args, **kwds)
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in _getitem_tuple_arg(self, arg)
    360                 else:
--> 361                     return columns_df.index._get_row_major(columns_df, arg)
    362         else:

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _get_row_major(self, df, row_tuple)
    926                 row_tuple = slice(row_tuple.start, self[-1], row_tuple.step)
--> 927         self._validate_indexer(row_tuple)
    928         valid_indices = self._get_valid_indices_by_tuple(

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _validate_indexer(self, indexer)
    958         else:
--> 959             for i in indexer:
    960                 self._validate_indexer(i)

TypeError: 'Timestamp' object is not iterable

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-87-fe779946243b> in <module>
     16 print(df.loc[stamp]) # SUCCEEDS
---> 17 print(df_gpu.loc[stamp]) # FAILS

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
    235                 return self._getitem_tuple_arg(arg)
    236             except (TypeError, KeyError, IndexError, ValueError):
--> 237                 return self._getitem_tuple_arg((arg, slice(None)))
    238         else:
    239             if not isinstance(arg, tuple):

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in _getitem_tuple_arg(self, arg)
    357             else:
    358                 if isinstance(arg, tuple):
--> 359                     return columns_df.index._get_row_major(columns_df, arg[0])
    360                 else:
    361                     return columns_df.index._get_row_major(columns_df, arg)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _get_row_major(self, df, row_tuple)
    925             if row_tuple.stop is None:
    926                 row_tuple = slice(row_tuple.start, self[-1], row_tuple.step)
--> 927         self._validate_indexer(row_tuple)
    928         valid_indices = self._get_valid_indices_by_tuple(
    929             df.index, row_tuple, len(df.index)

/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _validate_indexer(self, indexer)
    957             self._validate_indexer(indexer.stop)
    958         else:
--> 959             for i in indexer:
    960                 self._validate_indexer(i)

TypeError: 'Timestamp' object is not iterable

It looks like we go down a codepath that expects an iterable, which explains why wrapping with a tuple works (and may resolve your problem in the short term):

print(df_gpu.loc[(stamp,)]) # SUCCEEDS
A      1.184793
B     -0.253166
C     -0.790236
pbruneau commented 3 years ago

Hi @beckernick, thanks for the answer!

The "tuple trick" above seems to do the job for accessing a single value. However, I'm back into trouble if I want to fetch values for a timestamp range.

Elaborating from my previous gist example, if I type:

start = pd.Timestamp(datetime.strptime('2021-03-12 01:00+0000',  '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000',  '%Y-%m-%d %H:%M%z'))

I get the expected result:

timestamp                 label          
2021-03-12 01:00:00+00:00 A     -0.466112
                          B     -0.781473
                          C     -1.010174
2021-03-12 02:00:00+00:00 A      0.160179
                          B      1.007183
                          C     -1.053772

With cuDF, the following gets the usual TypeError: 'Timestamp' object is not iterable:


Alternatively, trying:


gets a SyntaxError: invalid syntax. Using a regular Timestamp range with:

start = pd.Timestamp(datetime.strptime('2021-03-12 01:00+0000',  '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000',  '%Y-%m-%d %H:%M%z'))
timestamps = pd.date_range(start, end, freq='1H')

I get ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Any idea for circumventing this issue?

pbruneau commented 3 years ago


I'm following up about the bug reported above, as reported in my last answer, using a tuple to access a Timestamp first level of a MultiIndex circumvents the issue pointed out initially, but the proposed solution fails if one wants to access a Timestamp range.

I realize that the title is not accurately reflecting the actually remaining bug: should I create a new issue which singles out the Timestamp range bug, or rename this one?

beckernick commented 2 years ago

We've been refactoring our MultiIndex implementation to help make it more efficient and maintainable. Is the pandas snippet in the comment above a minimal example of the desired behavior @pbruneau ?

pbruneau commented 2 years ago

Hi @beckernick,

Here is an updated minimal gist which lists in details what works, does not work, and workarounds (as of version 21.08.02 installed on my side).

In a nutshell (please refer to the gist for details): with a MultiIndex and timestamps as primary key, pandas allows to do this kind of operation:


with stamp and timestamps valid timestamp and timestamp range, respectively. I would like to do the same with cudf, but as of v21.08.02, it is impossible.

pbruneau commented 2 years ago

The problems reported above and highlighted in this minimal gist still occur with v21.12, with exactly the same error messages.

wence- commented 1 year ago

An update here (I am working through many indexing corner cases), apologies for the very slow responses.

In 23.06 (the current development version) there is an error constructing time-zone aware timestamps (previously they were accepted but handled incorrectly, now they are not accepted, soon they will be accepted and handled correctly). However, if I remove the timezone portion of the timestamps, then only your last example now fails (I reproduce here for posterity):

import pandas as pd
from datetime import datetime
import cudf
import numpy as np

start = pd.Timestamp(datetime.strptime('2021-03-12 00:00',  '%Y-%m-%d %H:%M'))
end = pd.Timestamp(datetime.strptime('2021-03-12 03:00',  '%Y-%m-%d %H:%M'))
timestamps = pd.date_range(start, end, freq='1H')
labels = ['A', 'B', 'C']
index = pd.MultiIndex.from_product([timestamps, labels], names=["timestamp", "label"])
value = np.random.normal(size=12)
df = pd.DataFrame(value, index=index, columns=['value'])

df_gpu = cudf.from_pandas(df)

start = pd.Timestamp(datetime.strptime('2021-03-12 01:00',  '%Y-%m-%d %H:%M'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00',  '%Y-%m-%d %H:%M'))
timestamps = pd.date_range(start, end, freq='1H')



# indexing with a slice range also fails in this case.
df_gpu.loc[start:end] # Fails
pbruneau commented 1 year ago

Hi @wence-,

I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then),

If I get it right, this means that:


works fine? I would already have a workaround, then!

wence- commented 1 year ago

Hi @wence-,

I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then),

If I get it right, this means that:


works fine? I would already have a workaround, then!

If your dataframe has a multiindex, that example does not yet work. If you just have a normal index, it does work.

pbruneau commented 1 year ago

Hi @wence-, I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then), If I get it right, this means that:


works fine? I would already have a workaround, then!

If your dataframe has a multiindex, that example does not yet work. If you just have a normal index, it does work.

OK! Good luck with the development then (even if luck has nothing to do with it :)