[BUG] cuDF and Pandas return different results for ...

Matt711 commented 2 months ago

Describe the bug This issue is for documenting differences found between cudf and pandas.

cudf	Pandas	Difference	Fixed
`cudf.Index([True],dtype=object)`	`pd.Index([True],dtype=object)`	Attribute `inferred_type` are different. [cudf]: `string` [pandas]: `bool`	Won't Fix (see this comment)
`cudf.date_range('2011-01-01', '2011-01-02', freq='h')`	`pd.date_range('2011-01-01', '2011-01-02', freq='h')`	Index length are different. [cudf]: `24` [pandas]: `25`	closed by #16516
`cdf[["a","a"]].shape # cdf = cudf.from_pandas(df)`	`df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})`	Shapes are different. [cudf]: `(3,1)` [pandas]: `(3,2)`	closed by #16514 because duplicate column labels are generally not supported.
`cdf.agg({'a': 'unique', 'b': 'unique'}).dtype`	`df.aggregate({'a': 'unique', 'b': 'unique'}).dtype`	dtypes aren't the same. [cudf]: `ListDtype(int64)` [pandas]: `dtype('O')`	Won't fix because the cudf result is better than the pandas one. See this comment.
`cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)`	`pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)`	Index values are different [cudf]: Starts with `'2016-01-01 01:01:00'` [pandas]: Starts with `'2016-01-03 01:01:00'`
`cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()`	`pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()`	Types are different [cudf]: `dict` [pandas]: `pd.Interval`	Won't fix because a `dict` is the "scalar" type of a cudf struct/interval type
`cudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")])`	`pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")])`	`names` attribute is empty in cudf	closed by #16515
`cudf.Series(range(2)).sample(n=2, replace=True).index`	`pd.Series(range(2)).sample(n=2, replace=True).index`	Index values are different [cudf]: `Index([0, 1], dtype='int64')` [pandas]: `Index([0, 0], dtype='int64')`
`cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype`	`pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype`	dtypes are different. [cudf]: `dtype('int64')` [pandas]: `Int64Dtype()`	This is should be fixed by closing #14149
`cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index`	`pd.Series(range(2), index=["a", "b"]).rename(str.upper).index`	Index values are different [cudf]: `Index(['a', 'b'], dtype='object')` [pandas]: `Index(['A', 'B'], dtype='object')`	#16525 now causes cudf to raise `NotImplementedError`.
`cudf.DataFrame({"A":[1,2]}).median()`	`pd.DataFrame({"A":[1,2]}).median()`	Types are different [cudf]: `np.float64` [pandas]: `pd.Series`	closed by #16527
`cudf.DataFrame({"A":[1]})**cudf.Series([0])`	`pd.DataFrame({"A":[1]})**pd.Series([0])`	At positional index 0, first diff: nan != 1.0 [cudf]: `NA` [pandas]: `1.0`	Tracked in #7478
`cudf.interval_range(start=0, end=1).repeat(3)`	`pd.interval_range(start=0, end=1).repeat(3)`	All Index values are all different [cudf]: `IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]')` [pandas]: `IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]')`
<!--	`\|` \| [cudf]: `[pandas]:`	-->

Steps/Code to reproduce bug I'll add a repro for each one I find.

Expected behavior It should probably match pandas.

mroeschke commented 2 months ago

cudf.Index([True],dtype=object)

Generally for dtype=object findings, there will always be a discrepancy since in pandas dtype=object mean "as pyobject" while in cudf it means "as string".

This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether cudf.pandas falls back or not based on whether that element is a string.

mroeschke commented 2 months ago

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in https://github.com/rapidsai/cudf/issues/14149

mroeschke commented 2 months ago

cudf.DataFrame({"A":[1]})**cudf.Series([0])

This is tracked in https://github.com/rapidsai/cudf/issues/7478, but IMO cudf is doing the right thing here

Matt711 commented 2 months ago

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

Sounds good. The dfs were.

df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

They do. I'm setting the seed like

import random
random.seed(2)

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

Thanks!

rapidsai / cudf

[BUG] cuDF and Pandas return different results for ... #16507