Open notwopr opened 7 months ago
I found if I convert the date column that is in date.datetime datatype to pandas datetime format, it works: Here's the original dataframe:
date NVDA
0 1999-01-22 0.376356
1 1999-01-25 0.415730
2 1999-01-26 0.383428
3 1999-01-27 0.382281
4 1999-01-28 0.381134
... ... ...
6277 2024-01-03 475.690000
6278 2024-01-04 479.980000
6279 2024-01-05 490.970000
6280 2024-01-08 522.530000
6281 2024-01-09 531.400000
I ran the following to convert the date column to pandas datetime datatype:
df['date'] = pd.to_datetime(df['date'])
Ran datatype check again and got the following:
{
'NVDA': set([<class 'float'>]),
'date': set([<class 'pandas._libs.tslibs.timestamps.Timestamp'>]),
}
date datetime64[ns]
NVDA float64
dtype: object
And now when I run cudf.from_pandas(df)
, I no longer get the error:
print(cudf.from_pandas(df))
date NVDA
0 1999-01-22 0.376356
1 1999-01-25 0.415730
2 1999-01-26 0.383428
3 1999-01-27 0.382281
4 1999-01-28 0.381134
... ... ...
6277 2024-01-03 475.690000
6278 2024-01-04 479.980000
6279 2024-01-05 490.970000
6280 2024-01-08 522.530000
6281 2024-01-09 531.400000
[6282 rows x 2 columns]
It seems even if a column is all just one datatype of date.datetime, it will throw the mixedtype error. converting it to pandas datetime datetype cures it. It'd be nice if cudf could accept date.datetime columns as is without the conversion.
Thanks for reporting and for the thorough analysis! This issue stems from the fact that pandas uses the object
datatype for storing datetime.date
values:
In [1]: import pandas as pd
In [2]: import datetime
In [3]: s = pd.Series([datetime.date.fromisoformat("2001-01-01"), datetime.date.fromisoformat("2001-01-02")])
In [4]: s
Out[4]:
0 2001-01-01
1 2001-01-02
Now, a pandas Series
of object
data type represents a collections of arbitrary python objects (possibly of differing types). When cudf
sees such a Series
, it tries to interpret the values either as strings, lists or dictionaries (data types that we support). If none of those work, we throw this error.
I think it would be less confusing if we threw a ValueError
here rather than a MixedTypeError
, with a more general error message like "couldn't not convert values in column to a supported data type"
.
Describe the bug I load a pandas dataframe into cudf using cudf.from_pandas(originaldataframe) and it gives me a mixed type error.
Steps/Code to reproduce bug Original Dataframe:
I created the following function to create a dictionary of all the unique datatypes found for each column, even if there are more than one type in a single column. Here's the function:
Here's the output of the column data types:
As you can see, the function only found one datatype for each column.
Alternatively, if I use pandas built in datatype command
dataframe.dtypes
I get the following:So by both tests, each column has only one data type. Though the .dtypes command shows "object" as the datatype. Perhaps that's causing cudf to throw the error?
Here is another example:
Running the following:
gives you the following:
As you can see, each column has only one data type. Yet when I try to convert df to a cudf using cudf.from_pandas(df), it throws the same mixed type error.
Expected behavior There's no apparent mixtype column in the dataframe so it should be able to open the dataframe without throwing the mixedtype error.
Environment overview (please complete the following information)
CUDA 12 installed NVIDIA GTX 1080 graphics card
Environment details Error thrown in detail:
Additional context Add any other context about the problem here.