rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.44k stars 903 forks source link

[BUG] dataframe reindex NaN #8801

Open aerdem4 opened 3 years ago

aerdem4 commented 3 years ago

Describe the bug cuDF DataFrame is not reindexed as intented. Using a multiindex, it ends up in NaNs.

Steps/Code to reproduce bug

data_df = cudf.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/c439ef22282f412ba39e9137a3fdabac.parquet')
offsets = data_df.groupby(['time_id'], as_index=False).agg({'seconds_in_bucket':'min'}).reset_index(drop=True)
offsets.columns = ['time_id', 'offset']
data_df = cudf.merge(data_df, offsets, on = ['time_id'], how = 'left')
data_df.seconds_in_bucket = data_df.seconds_in_bucket - data_df.offset
# MultiIndex.from_product uses pandas in the background
# That's why we need to transform the data into pd dataframe
data_df = data_df.set_index(['time_id', 'seconds_in_bucket'])
columns = [col for col in data_df.columns.values]
data_df = data_df.reindex(cudf.MultiIndex.from_product([data_df.to_pandas().index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket']), columns=columns).fillna(method='ffill')
data_df = cudf.DataFrame(data_df.reset_index())

https://www.kaggle.com/medali1992/optiver-train-dataset?scriptVersionId=68637709

This workaround works:

indices = cudf.MultiIndex.from_product([data_df.to_pandas().index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket'])
data_df = cudf.DataFrame().set_index(indices).join(data_df, how="left").fillna(method='ffill').reset_index(drop=True)

Expected behavior I expect reindex function to get the correct values instead of NaNs as in pandas.

Environment overview (please complete the following information) Kaggle GPU Docker, RAPIDS 21.06

medAli-ai commented 3 years ago

Describe the bug The workaround mentioned by @aerdem4 isn't working anymore.

This workaround isn't working

indices = cudf.MultiIndex.from_product([data_df.to_pandas().index.levels[0], np.arange(0,600)], names = ['time_id', 'seconds_in_bucket'])
data_df = cudf.DataFrame().set_index(indices).join(data_df, how="left").fillna(method='ffill').reset_index(drop=True)

Kaggle notebook mentioned above.

Environment overview (please complete the following information) Kaggle GPU Docker, RAPIDS 21.06

shwina commented 3 years ago

Thanks for reporting -- could you please point us to the parquet file being referenced in the code snippet above?

medAli-ai commented 3 years ago

Hi @shwina ,

You just need to click on this link. This is a kaggle notebook I created solely for this issue. You need to create a kaggle account and you are good to go. To enable GPU on kaggle notebook, just press the button Accelerator on the right below Environment preferences.

And to use the notebook you need to press the edit and copy button on the right of the screen.

Hope the explanation is helpful.

medAli-ai commented 3 years ago

Here is the link of the whole parquet dataset used in the competition.

beckernick commented 3 years ago

Thanks for the additional context and links. Is there perhaps a direct link to the data, outside of Kaggle?

medAli-ai commented 3 years ago

I can put the data into my google drive and share it with you if you want and besides you could download them from the kaggle website. Just create an account, sign in for the competition and you are good to go.

medAli-ai commented 3 years ago

Because you can't work with the data unless you accept the terms of the competiton

shwina commented 3 years ago

Hi @medAli-ai - thanks again. Are you able to reproduce the issue on a smaller dataset, or better yet, independent of any particular dataset? That would greatly help us triage and resolve the problem.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 6 months ago

This is not a MultiIndex issue, it's a data type issue.

Here is a minimal example that reproduces the problem without requiring the Kaggle dataset:

dtype = "int16"
# dtype = "int64" # We get the expected output if we uncomment this line

data_df = cudf.DataFrame({
    "time_id": cudf.Series([5, 5, 5, 5, 5], dtype=dtype),
    "seconds_in_bucket": cudf.Series([0, 1, 5, 6, 7], dtype=dtype)
})

offsets = data_df.groupby(
    ['time_id'], as_index=False
).agg(
    {'seconds_in_bucket':'min'}
).reset_index(drop=True)
offsets.columns = ['time_id', 'offset']
data_df = cudf.merge(data_df, offsets, on = ['time_id'], how = 'left')
data_df = data_df.set_index(['time_id', 'seconds_in_bucket'])
columns = [col for col in data_df.columns.values]
midx = cudf.MultiIndex.from_product(
    [data_df.to_pandas().index.levels[0], np.arange(0,5)],
    names=['time_id', 'seconds_in_bucket']
)
data_df = data_df.reindex(midx, columns=columns).fillna(method='ffill').reset_index()
print(data_df['offset'])

Note that in the second line I don't observe the same issue with NaNs if I change the type. So what is happening?

Here is a stripped down example that demonstrates the problem

import cudf as pd
# import pandas as pd  # This works

starting_dtype = "int16"
# starting_dtype = "int64"  # This works
idx = pd.Series([0, 1, 2, 3, 4], dtype=starting_dtype)
data_df = pd.DataFrame({
    "a": idx,
    "b": [0, 1, 5, 6, 7]
}).set_index(["a"])
data_df = data_df.reindex(idx.astype("int64")).fillna(method='ffill')
assert data_df["b"].isna().sum() == 0

The issue is that that in the dataset in question, the index columns are int16 columns, whereas the MultiIndex used for reindexing contains a int64 columns by virtue of the np.arange call. The smaller example above demonstrates that directly. In order to match pandas here we should be upcasting to handle a merge correctly. Under the hood reindexing involves a merge, so I think that this issue ultimately boils down to something like https://github.com/rapidsai/cudf/issues/14594.

vyasr commented 6 months ago

@wence- @mroeschke do you have any thoughts on if/how we should be addressing the above?

wence- commented 6 months ago

Merging already upcasts to a common dtype, so the problem is in reindexing, which has an explicit carve-out for whether the types in the left and right indices match:

# indexed_frame.py:_reindex
            idx_dtype_match = (df.index.nlevels == index.nlevels) and all(
                _is_same_dtype(left_dtype, right_dtype)
                for left_dtype, right_dtype in zip(
                    (col.dtype for col in df.index._data.columns),
                    (col.dtype for col in index._data.columns),
                )
            )

I think a good place to start would be to consolidate all of the dtype-matching/promotion utilities in one place. One of the whack-a-mole issues here is that right now we haven't codified anywhere the dtype promotion and matching rules (either in code or docs), so all of these different places have sui generis implementations that don't necessarily agree with each other.

vyasr commented 6 months ago

Good catch, thanks for pointing that out. We could definitely get a lot of mileage out of standardizing more of these kinds of internal helper logic functions.