rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[BUG] Column types after cudf.crosstab() does not match Pandas result #11896

Open miguelusque opened 2 years ago

miguelusque commented 2 years ago

Describe the bug Hi

While porting some code from Pandas, I have noticed that the column types after cudf.crosstab() does not match Pandas result.

Please, see a reproducer below:

> > import cudf
> > import pandas as pd
> > 
> > print(cudf.__version__, pd.__version__, '\n')
> > 
> > features = {'x': ['x1', 'x1', 'x2', 'x2', 'x2', 'x1', 'x2'],
> >             'y': ['y1', 'y2', 'y1', 'y2', 'y3', 'y1', 'y3']}
> > 
> > pdf = pd.DataFrame(features)
> > pdf = pd.crosstab(index = pdf.x, columns = pdf.y)
> > 
> > gdf = cudf.DataFrame(features)
> > gdf = cudf.crosstab(index = gdf.x, columns = gdf.y)
> > 
> > print(gdf.to_pandas().equals(pdf), '\n')
> > print(pdf.columns, '\n', gdf.columns)
> > 
> > 22.10.00a0+g17868b7 1.5.0 
> > 
> > False 
> > 
> > Index(['y1', 'y2', 'y3'], dtype='object', name='y') 
> >  MultiIndex([('y1',),
> >             ('y2',),
> >             ('y3',)],
> >            names=['y'])
> 

Expected behavior I would like the results between cuDF and Pandas match.

GregoryKimball commented 2 years ago

Thanks @miguelusque for raising this issue. ~I know that we've had trouble supporting the pandas MultiIndex behavior. I believe there was a proposal to drop MultiIndex support - how big of an impact would that be for the users you've worked with?~ To my surprise, it is cudf that is generating the MultiIndex - we should just return a simple Index instead!

miguelusque commented 2 years ago

Hi @GregoryKimball , thank you!.

Please find below the original code that I was porting from Pandas to cuDF. Unfortunately, .add_prefix() and .and_suffix() methods do not work with MultiIndex.

Original code:

df = df.to_pandas()
# Which department have user ordered products?
df_ = pd.crosstab(df.user_id, df.department_id).add_prefix('user_department_').add_suffix('_freq')
feature_list.append(cudf.from_pandas(df_))

Workaround:

 # Which department have user ordered products?
df_ = cudf.crosstab(df.user_id, df.department_id)
df_.columns = ['user_department_' + str(c[0]) + '_freq' for c in df_.columns]
feature_list.append(df_)

Hope it helps!