Open Matt711 opened 2 months ago
cudf.Index([True],dtype=object)
Generally for dtype=object
findings, there will always be a discrepancy since in pandas dtype=object
mean "as pyobject" while in cudf it means "as string".
This is tough because if the passed objects are strings, we want to accelerate the operation with cudf; otherwise, it can never be accelerated by cudf. There might need to be introspection of the first element to see whether cudf.pandas
falls back or not based on whether that element is a string.
More comments:
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype
Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's object
data type)
cudf.Series(range(2)).sample(n=2, replace=True).index
Do your observations persist if you use a fixed random seed?
cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
This is tracked in https://github.com/rapidsai/cudf/issues/14149
cudf.DataFrame({"A":[1]})**cudf.Series([0])
This is tracked in https://github.com/rapidsai/cudf/issues/7478, but IMO cudf is doing the right thing here
More comments:
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype
Would be nice to know the starting DataFrame for this, but generally I think cudf gives a better result than pandas because cudf has a native list type and pandas doesn't (pandas stores lists in it's
object
data type)
Sounds good. The dfs were.
df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
cdf = cudf.from_pandas(df)
cudf.Series(range(2)).sample(n=2, replace=True).index
Do your observations persist if you use a fixed random seed?
They do. I'm setting the seed like
import random
random.seed(2)
cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
This is tracked in #14149
Thanks!
Describe the bug This issue is for documenting differences found between cudf and pandas.
cudf.Index([True],dtype=object)
pd.Index([True],dtype=object)
inferred_type
are different. [cudf]:string
[pandas]:bool
cudf.date_range('2011-01-01', '2011-01-02', freq='h')
pd.date_range('2011-01-01', '2011-01-02', freq='h')
24
[pandas]:25
cdf[["a","a"]].shape # cdf = cudf.from_pandas(df)
df[["a","a"]].shape # df = pd.DataFrame({"a":[0,1,2], "b": [1,2,3]})
(3,1)
[pandas]:(3,2)
cdf.agg({'a': 'unique', 'b': 'unique'}).dtype
df.aggregate({'a': 'unique', 'b': 'unique'}).dtype
ListDtype(int64)
[pandas]:dtype('O')
cudf.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)
pd.date_range('2016-01-01 01:01:00', periods=5, freq='W', tz=None)
'2016-01-01 01:01:00'
[pandas]: Starts with'2016-01-03 01:01:00'
cudf.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()
pd.IntervalIndex.from_tuples([("2017-01-03", "2017-01-04"),],dtype='interval[datetime64[ns], right]').min()
dict
[pandas]:pd.Interval
dict
is the "scalar" type of a cudf struct/interval typecudf.MultiIndex.from_arrays([cudf.Index([1],name="foo"),cudf.Index([2], name="bar")])
pd.MultiIndex.from_arrays([pd.Index([1],name="foo"),pd.Index([2], name="bar")])
names
attribute is empty in cudfcudf.Series(range(2)).sample(n=2, replace=True).index
pd.Series(range(2)).sample(n=2, replace=True).index
Index([0, 1], dtype='int64')
[pandas]:Index([0, 0], dtype='int64')
cudf.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
pd.Series(range(2)).sample(n=2, replace=True).convert_dtypes().dtype
dtype('int64')
[pandas]:Int64Dtype()
cudf.Series(range(2), index=["a", "b"]).rename(str.upper).index
pd.Series(range(2), index=["a", "b"]).rename(str.upper).index
Index(['a', 'b'], dtype='object')
[pandas]:Index(['A', 'B'], dtype='object')
NotImplementedError
.cudf.DataFrame({"A":[1,2]}).median()
pd.DataFrame({"A":[1,2]}).median()
np.float64
[pandas]:pd.Series
cudf.DataFrame({"A":[1]})**cudf.Series([0])
pd.DataFrame({"A":[1]})**pd.Series([0])
NA
[pandas]:1.0
cudf.interval_range(start=0, end=1).repeat(3)
pd.interval_range(start=0, end=1).repeat(3)
IntervalIndex([(0, 0], (0, 0], (0, 0]], dtype='interval[int64, right]')
[pandas]:IntervalIndex([(0, 1], (0, 1], (0, 1]], dtype='interval[int64, right]')
|
| [cudf]:[pandas]:
Steps/Code to reproduce bug I'll add a repro for each one I find.
Expected behavior It should probably match pandas.