ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.52k stars 5.69k forks source link

[Data] Unexpected dtype changes/errors when converting to `pandas.DataFrame` #41545

Open bdewilde opened 10 months ago

bdewilde commented 10 months ago

What happened + What you expected to happen

I recently upgraded from ray v2.3 to v2.6+, and noticed that my data's dtypes are now changing when converting from ray.data.Dataset to pd.DataFrame, or no longer work at all as they used to. I'm not sure if this is a bug or new "expected" behavior -- it's just unexpected to me.

For instance... An example dataset from the docs' "loading data from other libraries" section converts one of the columns from a "string" dtype to an "object":

>>> ds = ray.data.from_items(
    [
        {"food": "spam", "price": 9.34},
        {"food": "ham", "price": 5.37},
        {"food": "eggs", "price": 0.94}
    ]
)
>>> ds
MaterializedDataset(
   num_blocks=3,
   num_rows=3,
   schema={food: string, price: double}
)
>>> df = ds.to_pandas()
>>> df
   food  price
0  spam   9.34
1   ham   5.37
2  eggs   0.94
>>> df.dtypes
food      object
price    float64
dtype: object

These dtypes behave very differently, and it isn't always convenient to manually cast them back into the correct dtypes when, say, you're using Dataset.map_batches(..., batch_format="pandas"). I would've expected the dtypes to stay consistent, which (afaict) both pandas and pyarrow documentations suggest should be the case.

Versions / Dependencies

PY3.9.18
macOS 14.1.1
ray == 2.7.1
pandas == 2.1.3
pyarrow == 14.0.1

Reproduction script

Included above in the what happened / what expected box is one example. Here's another, with different behavior:

>>> import pandas as pd
>>> import ray.data
>>> df_orig = pd.DataFrame({"foo": ["a", "b", "c"], "bar": [1, 2, 3]}).astype({"foo": "string", "bar": "int64"})
>>> df_orig.dtypes
foo    string[python]
bar             int64
dtype: object
>>> ds = ray.data.from_pandas(df_orig)
2023-12-01 09:33:13,601 INFO worker.py:1642 -- Started a local Ray instance.
2023-12-01 09:33:14,556 ERROR dataset.py:5285 -- Error converting dtype string to Arrow.
Traceback (most recent call last):
  File "/Users/burtondewilde/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/dataset.py", line 5281, in types
    arrow_types.append(pa.from_numpy_dtype(dtype))
  File "pyarrow/types.pxi", line 5138, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'string[python]' as a data type
>>> df_new = ds.to_pandas()
>>> df_new.dtypes
foo    string[python]
bar             int64
dtype: object

In this case, going from pandas to ray and back, the string dtype seems to be preserved, but a TypeError is raised in the process. Confusing!

Issue Severity

Low: It annoys or frustrates me.

anyscalesam commented 10 months ago

Functionally looks OK; error message looks safe to ignore for now. We'll dig deeper into the root cause cc @scottjlee

bdewilde commented 10 months ago

I've run into another example with the same env setup, sharing here just for reference:

>>> import pandas as pd
>>> import ray.data
>>> foo_df = pd.DataFrame(data=pd.date_range("2023-12-01T00:00:00", "2023-12-02T00:00:00", freq="1H", tz="UTC"), columns=["dttm"])
>>> foo_df.dtypes
dttm    datetime64[ns, UTC]
dtype: object
>>> foo_ds = ray.data.from_pandas(foo_df)
>>> foo_ds
MaterializedDataset(
   num_blocks=1,
   num_rows=25,
   schema={dttm: datetime64[ns, UTC]}
)
>>> foo_ds.schema()
2023-12-07 18:01:52,585 ERROR dataset.py:5285 -- Error converting dtype datetime64[ns, UTC] to Arrow.
Traceback (most recent call last):
  File "/Users/burtondewilde/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/dataset.py", line 5281, in types
    arrow_types.append(pa.from_numpy_dtype(dtype))
  File "pyarrow/types.pxi", line 5138, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'datetime64[ns, UTC]' as a data type

Column  Type
------  ----
dttm    None
>>> foo_ds.to_pandas().dtypes
dttm    datetime64[ns, UTC]
dtype: object