rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.41k stars 899 forks source link

[BUG] cuDF and Pandas return different results for ... #16507

Open Matt711 opened 2 months ago

Matt711 commented 2 months ago

Describe the bug This issue is for documenting differences found between cudf and pandas.

cudf Pandas Difference Fixed
cudf.Index([True],dtype=object) pd.Index([True],dtype=object) Attribute inferred_type are different. [cudf]: string [pandas]: bool Won't Fix (see this comment)
cudf.date_range('2011-01-01', '2011-01-02', freq='h') pd.date_range('2011-01-01', '2011-01-02', freq='h') Index length are different. [cudf]: 24 [pandas]: 25 closed by #16516
cdf[["a","a"]].shape # cdf = cudf.from_pandas(df) df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]}) Shapes are different. [cudf]: (3,1) [pandas]: (3,2) closed by #16514 because duplicate column labels are generally not supported.
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype df.aggregate({'a': 'unique', 'b': 'unique'}).dtype dtypes aren't the same. [cudf]: ListDtype(int64) [pandas]: dtype('O') Won't fix because the cudf result is better than the pandas one. See this comment.
cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None) pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None) Index values are different [cudf]: Starts with '2016-01-01 01:01:00' [pandas]: Starts with '2016-01-03 01:01:00'
cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min() pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min() Types are different [cudf]: dict [pandas]: pd.Interval Won't fix because a dict is the "scalar" type of a cudf struct/interval type
cudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")]) pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")]) names attribute is empty in cudf closed by #16515
cudf.Series(range(2)).sample(n=2, replace=True).index pd.Series(range(2)).sample(n=2, replace=True).index Index values are different [cudf]: Index([0, 1], dtype='int64') [pandas]: Index([0, 0], dtype='int64')
cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype dtypes are different. [cudf]: dtype('int64') [pandas]: Int64Dtype() This is should be fixed by closing #14149
cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index pd.Series(range(2), index=["a", "b"]).rename(str.upper).index Index values are different [cudf]: Index(['a', 'b'], dtype='object') [pandas]: Index(['A', 'B'], dtype='object') #16525 now causes cudf to raise NotImplementedError.
cudf.DataFrame({"A":[1,2]}).median() pd.DataFrame({"A":[1,2]}).median() Types are different [cudf]: np.float64 [pandas]: pd.Series closed by #16527
cudf.DataFrame({"A":[1]})**cudf.Series([0]) pd.DataFrame({"A":[1]})**pd.Series([0]) At positional index 0, first diff: nan != 1.0 [cudf]: NA [pandas]: 1.0 Tracked in #7478
cudf.interval_range(start=0, end=1).repeat(3) pd.interval_range(start=0, end=1).repeat(3) All Index values are all different [cudf]: IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]') [pandas]: IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]')
<!-- | | [cudf]: [pandas]: -->

Steps/Code to reproduce bug I'll add a repro for each one I find.

Expected behavior It should probably match pandas.

mroeschke commented 2 months ago

cudf.Index([True],dtype=object)

Generally for dtype=object findings, there will always be a discrepancy since in pandas dtype=object mean "as pyobject" while in cudf it means "as string".

This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether cudf.pandas falls back or not based on whether that element is a string.

mroeschke commented 2 months ago

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in https://github.com/rapidsai/cudf/issues/14149

mroeschke commented 2 months ago

cudf.DataFrame({"A":[1]})**cudf.Series([0])

This is tracked in https://github.com/rapidsai/cudf/issues/7478, but IMO cudf is doing the right thing here

Matt711 commented 2 months ago

More comments:

cdf.agg({'a': 'unique', 'b': 'unique'}).dtype

Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object data type)

Sounds good. The dfs were.

df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)

cudf.Series(range(2)).sample(n=2, replace=True).index

Do your observations persist if you use a fixed random seed?

They do. I'm setting the seed like

import random
random.seed(2)

cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype

This is tracked in #14149

Thanks!