Open pbruneau opened 3 years ago
Thanks for including a simple reproducer gist. I've included it below for ease of access.
import pandas as pd
import cudf
import numpy as np
start = pd.Timestamp(datetime.strptime('2021-03-12 00:00+0000', '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 03:00+0000', '%Y-%m-%d %H:%M%z'))
timestamps = pd.date_range(start, end, freq='1H')
labels = ['A', 'B', 'C']
index = pd.MultiIndex.from_product([timestamps, labels], names=["timestamp", "label"])
value = np.random.normal(size=12)
df = pd.DataFrame(value, index=index, columns=['value'])
df_gpu = cudf.from_pandas(df)
stamp = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000', '%Y-%m-%d %H:%M%z'))
print(df.loc[stamp]) # SUCCEEDS
print(df_gpu.loc[stamp]) # FAILS
value
label
A 1.184793
B -0.253166
C -0.790236
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
234 try:
--> 235 return self._getitem_tuple_arg(arg)
236 except (TypeError, KeyError, IndexError, ValueError):
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/contextlib.py in inner(*args, **kwds)
74 with self._recreate_cm():
---> 75 return func(*args, **kwds)
76 return inner
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in _getitem_tuple_arg(self, arg)
360 else:
--> 361 return columns_df.index._get_row_major(columns_df, arg)
362 else:
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _get_row_major(self, df, row_tuple)
926 row_tuple = slice(row_tuple.start, self[-1], row_tuple.step)
--> 927 self._validate_indexer(row_tuple)
928 valid_indices = self._get_valid_indices_by_tuple(
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _validate_indexer(self, indexer)
958 else:
--> 959 for i in indexer:
960 self._validate_indexer(i)
TypeError: 'Timestamp' object is not iterable
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-87-fe779946243b> in <module>
15
16 print(df.loc[stamp]) # SUCCEEDS
---> 17 print(df_gpu.loc[stamp]) # FAILS
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in __getitem__(self, arg)
235 return self._getitem_tuple_arg(arg)
236 except (TypeError, KeyError, IndexError, ValueError):
--> 237 return self._getitem_tuple_arg((arg, slice(None)))
238 else:
239 if not isinstance(arg, tuple):
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/contextlib.py in inner(*args, **kwds)
73 def inner(*args, **kwds):
74 with self._recreate_cm():
---> 75 return func(*args, **kwds)
76 return inner
77
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/indexing.py in _getitem_tuple_arg(self, arg)
357 else:
358 if isinstance(arg, tuple):
--> 359 return columns_df.index._get_row_major(columns_df, arg[0])
360 else:
361 return columns_df.index._get_row_major(columns_df, arg)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _get_row_major(self, df, row_tuple)
925 if row_tuple.stop is None:
926 row_tuple = slice(row_tuple.start, self[-1], row_tuple.step)
--> 927 self._validate_indexer(row_tuple)
928 valid_indices = self._get_valid_indices_by_tuple(
929 df.index, row_tuple, len(df.index)
/raid/nicholasb/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/multiindex.py in _validate_indexer(self, indexer)
957 self._validate_indexer(indexer.stop)
958 else:
--> 959 for i in indexer:
960 self._validate_indexer(i)
961
TypeError: 'Timestamp' object is not iterable
It looks like we go down a codepath that expects an iterable, which explains why wrapping with a tuple works (and may resolve your problem in the short term):
print(df_gpu.loc[(stamp,)]) # SUCCEEDS
value
label
A 1.184793
B -0.253166
C -0.790236
Hi @beckernick, thanks for the answer!
The "tuple trick" above seems to do the job for accessing a single value. However, I'm back into trouble if I want to fetch values for a timestamp range.
Elaborating from my previous gist example, if I type:
start = pd.Timestamp(datetime.strptime('2021-03-12 01:00+0000', '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000', '%Y-%m-%d %H:%M%z'))
print(df.loc[start:end])
I get the expected result:
value
timestamp label
2021-03-12 01:00:00+00:00 A -0.466112
B -0.781473
C -1.010174
2021-03-12 02:00:00+00:00 A 0.160179
B 1.007183
C -1.053772
With cuDF, the following gets the usual TypeError: 'Timestamp' object is not iterable
:
print(df_gpu.loc[start:end])
Alternatively, trying:
print(df_gpu.loc[(start:end,)])
gets a SyntaxError: invalid syntax
. Using a regular Timestamp range with:
start = pd.Timestamp(datetime.strptime('2021-03-12 01:00+0000', '%Y-%m-%d %H:%M%z'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00+0000', '%Y-%m-%d %H:%M%z'))
timestamps = pd.date_range(start, end, freq='1H')
print(df_gpu.loc[(timestamps,)])
I get ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
. Any idea for circumventing this issue?
Hi,
I'm following up about the bug reported above, as reported in my last answer, using a tuple to access a Timestamp first level of a MultiIndex circumvents the issue pointed out initially, but the proposed solution fails if one wants to access a Timestamp range.
I realize that the title is not accurately reflecting the actually remaining bug: should I create a new issue which singles out the Timestamp range bug, or rename this one?
We've been refactoring our MultiIndex implementation to help make it more efficient and maintainable. Is the pandas snippet in the comment above a minimal example of the desired behavior @pbruneau ?
Hi @beckernick,
Here is an updated minimal gist which lists in details what works, does not work, and workarounds (as of version 21.08.02 installed on my side).
In a nutshell (please refer to the gist for details): with a MultiIndex and timestamps as primary key, pandas allows to do this kind of operation:
df.loc[stamp]
df.loc[timestamps]
with stamp
and timestamps
valid timestamp and timestamp range, respectively. I would like to do the same with cudf, but as of v21.08.02, it is impossible.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
The problems reported above and highlighted in this minimal gist still occur with v21.12, with exactly the same error messages.
An update here (I am working through many indexing corner cases), apologies for the very slow responses.
In 23.06 (the current development version) there is an error constructing time-zone aware timestamps (previously they were accepted but handled incorrectly, now they are not accepted, soon they will be accepted and handled correctly). However, if I remove the timezone portion of the timestamps, then only your last example now fails (I reproduce here for posterity):
import pandas as pd
from datetime import datetime
import cudf
import numpy as np
start = pd.Timestamp(datetime.strptime('2021-03-12 00:00', '%Y-%m-%d %H:%M'))
end = pd.Timestamp(datetime.strptime('2021-03-12 03:00', '%Y-%m-%d %H:%M'))
timestamps = pd.date_range(start, end, freq='1H')
labels = ['A', 'B', 'C']
index = pd.MultiIndex.from_product([timestamps, labels], names=["timestamp", "label"])
value = np.random.normal(size=12)
df = pd.DataFrame(value, index=index, columns=['value'])
df_gpu = cudf.from_pandas(df)
start = pd.Timestamp(datetime.strptime('2021-03-12 01:00', '%Y-%m-%d %H:%M'))
end = pd.Timestamp(datetime.strptime('2021-03-12 02:00', '%Y-%m-%d %H:%M'))
timestamps = pd.date_range(start, end, freq='1H')
# SUCCEEDS
print(df.loc[timestamps])
# FAILS
print(df_gpu.loc[timestamps])
# indexing with a slice range also fails in this case.
df_gpu.loc[start:end] # Fails
Hi @wence-,
I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then),
If I get it right, this means that:
print(df_gpu.loc[timestamps])
works fine? I would already have a workaround, then!
Hi @wence-,
I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then),
If I get it right, this means that:
print(df_gpu.loc[timestamps])
works fine? I would already have a workaround, then!
If your dataframe has a multiindex, that example does not yet work. If you just have a normal index, it does work.
Hi @wence-, I'm installing via Docker, so I can't check out by myself (23.06 does not seem to be available then), If I get it right, this means that:
print(df_gpu.loc[timestamps])
works fine? I would already have a workaround, then!
If your dataframe has a multiindex, that example does not yet work. If you just have a normal index, it does work.
OK! Good luck with the development then (even if luck has nothing to do with it :)
Describe the bug cuDF DataFrames indexed by a Timestamp range can be accessed using
.loc[]
without any problem. However, if the cuDF DataFrame is indexed with a MultiIndex with timestamps as the first key,.loc[]
fails, when doing so causes no issue with pandas.Steps/Code to reproduce bug The following gist holds a self-contained example. The last line of the code fails with error:
TypeError: 'Timestamp' object is not iterable
Expected behavior I would expect the pandas and cuDF snippets to behave similarly.
Environment overview (please complete the following information)
Environment details
Click here to see environment details