Open bdewilde opened 10 months ago
Functionally looks OK; error message looks safe to ignore for now. We'll dig deeper into the root cause cc @scottjlee
I've run into another example with the same env setup, sharing here just for reference:
>>> import pandas as pd
>>> import ray.data
>>> foo_df = pd.DataFrame(data=pd.date_range("2023-12-01T00:00:00", "2023-12-02T00:00:00", freq="1H", tz="UTC"), columns=["dttm"])
>>> foo_df.dtypes
dttm datetime64[ns, UTC]
dtype: object
>>> foo_ds = ray.data.from_pandas(foo_df)
>>> foo_ds
MaterializedDataset(
num_blocks=1,
num_rows=25,
schema={dttm: datetime64[ns, UTC]}
)
>>> foo_ds.schema()
2023-12-07 18:01:52,585 ERROR dataset.py:5285 -- Error converting dtype datetime64[ns, UTC] to Arrow.
Traceback (most recent call last):
File "/Users/burtondewilde/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/dataset.py", line 5281, in types
arrow_types.append(pa.from_numpy_dtype(dtype))
File "pyarrow/types.pxi", line 5138, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'datetime64[ns, UTC]' as a data type
Column Type
------ ----
dttm None
>>> foo_ds.to_pandas().dtypes
dttm datetime64[ns, UTC]
dtype: object
What happened + What you expected to happen
I recently upgraded from
ray
v2.3 to v2.6+, and noticed that my data's dtypes are now changing when converting fromray.data.Dataset
topd.DataFrame
, or no longer work at all as they used to. I'm not sure if this is a bug or new "expected" behavior -- it's just unexpected to me.For instance... An example dataset from the docs' "loading data from other libraries" section converts one of the columns from a "string" dtype to an "object":
These dtypes behave very differently, and it isn't always convenient to manually cast them back into the correct dtypes when, say, you're using
Dataset.map_batches(..., batch_format="pandas")
. I would've expected the dtypes to stay consistent, which (afaict) bothpandas
andpyarrow
documentations suggest should be the case.Versions / Dependencies
Reproduction script
Included above in the what happened / what expected box is one example. Here's another, with different behavior:
In this case, going from pandas to ray and back, the string dtype seems to be preserved, but a
TypeError
is raised in the process. Confusing!Issue Severity
Low: It annoys or frustrates me.