rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 884 forks source link

MixedTypeError when there is no mixed type [BUG] #14793

Open notwopr opened 7 months ago

notwopr commented 7 months ago

Describe the bug I load a pandas dataframe into cudf using cudf.from_pandas(originaldataframe) and it gives me a mixed type error.

Steps/Code to reproduce bug Original Dataframe:

     symbol                                               name STOCK_TYPE  first_date   last_date    AGE                INDUSTRY   marketcap
0     WHFBZ      WhiteHorse Finance, Inc. 6.50% Notes due 2025     common  2018-11-30  2021-12-16   1112                 Unknown    0.000000
1       ANH                 Anworth Mortgage Asset Corporation     common  1998-03-12  2021-03-19   8408                 Unknown    0.000000
2       CEE          The Central and Eastern Europe Fund, Inc.     common  1990-02-28  2024-01-09  12368        Asset Management    0.062059
3      SEMR                             SEMrush Holdings, Inc.     common  2021-03-24  2024-01-09   1021  Software - Application    1.780361
4      BWMX  Betterware de Mexico, S.A.P.I. de C.V. Ordinar...     common  2020-03-16  2024-01-09   1394        Specialty Retail    0.470934
...     ...                                                ...        ...         ...         ...    ...                     ...         ...
9281    GHI  Greystone Housing Impact Investors LP Benefici...     common  1986-04-02  2024-01-09  13796        Mortgage Finance    0.387465
9282    LMT                              Lockheed Martin Corp.     common  1977-01-03  2024-01-09  17172     Aerospace & Defense  113.205101
9283   ^DJI                                          Dow Jones      index  1970-01-02  2024-01-08  19729                   Index    0.000000
9284   ^INX                                            S&P 500      index  1970-01-02  2024-01-08  19729                   Index    0.000000
9285  ^IXIC                                             NASDAQ      index  1971-02-05  2024-01-08  19330                   Index    0.000000

I created the following function to create a dictionary of all the unique datatypes found for each column, even if there are more than one type in a single column. Here's the function:

def get_column_data_types(dataframe):
    column_data_types = {}

    for column in dataframe.columns:
        unique_types = set(type(value) for value in dataframe[column])
        column_data_types[column] = unique_types

    return column_data_types

Here's the output of the column data types:

{
    'AGE': set([<class 'int'>]),
    'INDUSTRY': set([<class 'str'>]),
    'STOCK_TYPE': set([<class 'str'>]),
    'first_date': set([<class 'datetime.date'>]),
    'last_date': set([<class 'datetime.date'>]),
    'marketcap': set([<class 'float'>]),
    'name': set([<class 'str'>]),
    'symbol': set([<class 'str'>]),
}

As you can see, the function only found one datatype for each column.

Alternatively, if I use pandas built in datatype command dataframe.dtypes I get the following:

symbol         object
name           object
STOCK_TYPE     object
first_date     object
last_date      object
AGE             int64
INDUSTRY       object
marketcap     float64
dtype: object

So by both tests, each column has only one data type. Though the .dtypes command shows "object" as the datatype. Perhaps that's causing cudf to throw the error?

Here is another example:

            date        NVDA
0     1999-01-22    0.376356
1     1999-01-25    0.415730
2     1999-01-26    0.383428
3     1999-01-27    0.382281
4     1999-01-28    0.381134
...          ...         ...
6277  2024-01-03  475.690000
6278  2024-01-04  479.980000
6279  2024-01-05  490.970000
6280  2024-01-08  522.530000
6281  2024-01-09  531.400000

Running the following:

pprint(get_column_data_types(df))
pprint(df.dtypes)

gives you the following:

{'NVDA': set([<class 'float'>]), 'date': set([<class 'datetime.date'>])}

date     object
NVDA    float64
dtype: object

As you can see, each column has only one data type. Yet when I try to convert df to a cudf using cudf.from_pandas(df), it throws the same mixed type error.

Expected behavior There's no apparent mixtype column in the dataframe so it should be able to open the dataframe without throwing the mixedtype error.

Environment overview (please complete the following information)

CUDA 12 installed NVIDIA GTX 1080 graphics card

Environment details Error thrown in detail:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/notwopr/.local/lib/python3.10/site-packages/cudf/pandas/__main__.py", line 91, in <module>
    main()
  File "/home/notwopr/.local/lib/python3.10/site-packages/cudf/pandas/__main__.py", line 87, in main
    runpy.run_path(args.args[0], run_name="__main__")
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "scratchp7.py", line 29, in <module>
    stockdatastats = FileOperations().readpkl(oldfn, DirPaths().full_info_db)
  File "/home/notwopr/beluga/beluga3/file_functions.py", line 57, in readpkl
    data = cudf.from_pandas(data)
  File "/home/notwopr/.local/lib/python3.10/site-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)
  File "/home/notwopr/.local/lib/python3.10/site-packages/cudf/core/dataframe.py", line 7891, in from_pandas
    return DataFrame.from_pandas(obj, nan_as_null=nan_as_null)
  File "/home/notwopr/.local/lib/python3.10/site-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)
  File "/home/notwopr/.local/lib/python3.10/site-packages/cudf/core/dataframe.py", line 5237, in from_pandas
    data[col_name] = column.as_column(
  File "/home/notwopr/.local/lib/python3.10/site-packages/cudf/core/column/column.py", line 2279, in as_column
    raise MixedTypeError("Cannot create column with mixed types")
cudf.errors.MixedTypeError: Cannot create column with mixed types

Additional context Add any other context about the problem here.

notwopr commented 7 months ago

I found if I convert the date column that is in date.datetime datatype to pandas datetime format, it works: Here's the original dataframe:

            date        NVDA
0     1999-01-22    0.376356
1     1999-01-25    0.415730
2     1999-01-26    0.383428
3     1999-01-27    0.382281
4     1999-01-28    0.381134
...          ...         ...
6277  2024-01-03  475.690000
6278  2024-01-04  479.980000
6279  2024-01-05  490.970000
6280  2024-01-08  522.530000
6281  2024-01-09  531.400000

I ran the following to convert the date column to pandas datetime datatype:

df['date'] = pd.to_datetime(df['date'])

Ran datatype check again and got the following:

{
    'NVDA': set([<class 'float'>]),
    'date': set([<class 'pandas._libs.tslibs.timestamps.Timestamp'>]),
}

date    datetime64[ns]
NVDA           float64
dtype: object

And now when I run cudf.from_pandas(df), I no longer get the error:

print(cudf.from_pandas(df))
           date        NVDA
0    1999-01-22    0.376356
1    1999-01-25    0.415730
2    1999-01-26    0.383428
3    1999-01-27    0.382281
4    1999-01-28    0.381134
...         ...         ...
6277 2024-01-03  475.690000
6278 2024-01-04  479.980000
6279 2024-01-05  490.970000
6280 2024-01-08  522.530000
6281 2024-01-09  531.400000

[6282 rows x 2 columns]
notwopr commented 7 months ago

It seems even if a column is all just one datatype of date.datetime, it will throw the mixedtype error. converting it to pandas datetime datetype cures it. It'd be nice if cudf could accept date.datetime columns as is without the conversion.

shwina commented 7 months ago

Thanks for reporting and for the thorough analysis! This issue stems from the fact that pandas uses the object datatype for storing datetime.date values:

In [1]: import pandas as pd

In [2]: import datetime

In [3]: s = pd.Series([datetime.date.fromisoformat("2001-01-01"), datetime.date.fromisoformat("2001-01-02")])

In [4]: s
Out[4]:
0    2001-01-01
1    2001-01-02

Now, a pandas Series of object data type represents a collections of arbitrary python objects (possibly of differing types). When cudf sees such a Series, it tries to interpret the values either as strings, lists or dictionaries (data types that we support). If none of those work, we throw this error.

I think it would be less confusing if we threw a ValueError here rather than a MixedTypeError, with a more general error message like "couldn't not convert values in column to a supported data type".