Open aerdem4 opened 3 years ago
Describe the bug The workaround mentioned by @aerdem4 isn't working anymore.
This workaround isn't working
indices = cudf.MultiIndex.from_product([data_df.to_pandas().index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket'])
data_df = cudf.DataFrame().set_index(indices).join(data_df, how="left").fillna(method='ffill').reset_index(drop=True)
Kaggle notebook mentioned above.
Environment overview (please complete the following information) Kaggle GPU Docker, RAPIDS 21.06
Thanks for reporting -- could you please point us to the parquet
file being referenced in the code snippet above?
Hi @shwina ,
You just need to click on this link. This is a kaggle notebook I created solely for this issue. You need to create a kaggle account and you are good to go. To enable GPU on kaggle notebook, just press the button Accelerator on the right below Environment preferences.
And to use the notebook you need to press the edit and copy button on the right of the screen.
Hope the explanation is helpful.
Here is the link of the whole parquet dataset used in the competition.
Thanks for the additional context and links. Is there perhaps a direct link to the data, outside of Kaggle?
I can put the data into my google drive and share it with you if you want and besides you could download them from the kaggle website. Just create an account, sign in for the competition and you are good to go.
Because you can't work with the data unless you accept the terms of the competiton
Hi @medAli-ai - thanks again. Are you able to reproduce the issue on a smaller dataset, or better yet, independent of any particular dataset? That would greatly help us triage and resolve the problem.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This is not a MultiIndex issue, it's a data type issue.
Here is a minimal example that reproduces the problem without requiring the Kaggle dataset:
dtype = "int16"
# dtype = "int64" # We get the expected output if we uncomment this line
data_df = cudf.DataFrame({
"time_id": cudf.Series([5, 5, 5, 5, 5], dtype=dtype),
"seconds_in_bucket": cudf.Series([0, 1, 5, 6, 7], dtype=dtype)
})
offsets = data_df.groupby(
['time_id'], as_index=False
).agg(
{'seconds_in_bucket':'min'}
).reset_index(drop=True)
offsets.columns = ['time_id', 'offset']
data_df = cudf.merge(data_df, offsets, on = ['time_id'], how = 'left')
data_df = data_df.set_index(['time_id', 'seconds_in_bucket'])
columns = [col for col in data_df.columns.values]
midx = cudf.MultiIndex.from_product(
[data_df.to_pandas().index.levels[0], np.arange(0,5)],
names=['time_id', 'seconds_in_bucket']
)
data_df = data_df.reindex(midx, columns=columns).fillna(method='ffill').reset_index()
print(data_df['offset'])
Note that in the second line I don't observe the same issue with NaNs if I change the type. So what is happening?
Here is a stripped down example that demonstrates the problem
import cudf as pd
# import pandas as pd # This works
starting_dtype = "int16"
# starting_dtype = "int64" # This works
idx = pd.Series([0, 1, 2, 3, 4], dtype=starting_dtype)
data_df = pd.DataFrame({
"a": idx,
"b": [0, 1, 5, 6, 7]
}).set_index(["a"])
data_df = data_df.reindex(idx.astype("int64")).fillna(method='ffill')
assert data_df["b"].isna().sum() == 0
The issue is that that in the dataset in question, the index columns are int16 columns, whereas the MultiIndex used for reindexing contains a int64 columns by virtue of the np.arange
call. The smaller example above demonstrates that directly. In order to match pandas here we should be upcasting to handle a merge correctly. Under the hood reindexing involves a merge, so I think that this issue ultimately boils down to something like https://github.com/rapidsai/cudf/issues/14594.
@wence- @mroeschke do you have any thoughts on if/how we should be addressing the above?
Merging already upcasts to a common dtype, so the problem is in reindexing, which has an explicit carve-out for whether the types in the left and right indices match:
# indexed_frame.py:_reindex
idx_dtype_match = (df.index.nlevels == index.nlevels) and all(
_is_same_dtype(left_dtype, right_dtype)
for left_dtype, right_dtype in zip(
(col.dtype for col in df.index._data.columns),
(col.dtype for col in index._data.columns),
)
)
I think a good place to start would be to consolidate all of the dtype-matching/promotion utilities in one place. One of the whack-a-mole issues here is that right now we haven't codified anywhere the dtype promotion and matching rules (either in code or docs), so all of these different places have sui generis implementations that don't necessarily agree with each other.
Good catch, thanks for pointing that out. We could definitely get a lot of mileage out of standardizing more of these kinds of internal helper logic functions.
Describe the bug cuDF DataFrame is not reindexed as intented. Using a multiindex, it ends up in NaNs.
Steps/Code to reproduce bug
https://www.kaggle.com/medali1992/optiver-train-dataset?scriptVersionId=68637709
This workaround works:
Expected behavior I expect reindex function to get the correct values instead of NaNs as in pandas.
Environment overview (please complete the following information) Kaggle GPU Docker, RAPIDS 21.06