ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.15k stars 5.61k forks source link

[Data] Error converting dtype category to Arrow #41974

Open Taurus-Le opened 9 months ago

Taurus-Le commented 9 months ago

What happened + What you expected to happen

  1. I'm running the Categorizer example from official doc. Yet I got a exception which indicated CategoricalDtype cannot be interpreted as a data type.
  2. Expected behaviour: The types should be printed out in console.
    [CategoricalDtype(categories=['female', 'male'], ordered=False),
    CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)]
  3. Here is the logs:

    D:\Work\Python\RayDemo3.8\venv\Scripts\python.exe D:\Work\Python\RayDemo3.8\t.py
    2023-12-18 08:42:21,357 INFO worker.py:1664 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
    2023-12-18 08:42:23,285 INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(get_pd_value_counts)]
    2023-12-18 08:42:23,285 INFO streaming_executor.py:105 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
    2023-12-18 08:42:23,285 INFO streaming_executor.py:107 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
    2023-12-18 08:42:23,331 INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(Categorizer._transform_pandas)] -> LimitOperator[limit=1]
    2023-12-18 08:42:23,332 INFO streaming_executor.py:105 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
    2023-12-18 08:42:23,332 INFO streaming_executor.py:107 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
    2023-12-18 08:42:23,398 ERROR dataset.py:5034 -- Error converting dtype category to Arrow.
    Traceback (most recent call last):
      File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\data\dataset.py", line 5030, in types
        arrow_types.append(pa.from_numpy_dtype(dtype))
      File "pyarrow\types.pxi", line 4909, in pyarrow.lib.from_numpy_dtype
    TypeError: Cannot interpret 'CategoricalDtype(categories=['female', 'male'], ordered=False)' as a data type
    2023-12-18 08:42:23,399 ERROR dataset.py:5034 -- Error converting dtype category to Arrow.
    Traceback (most recent call last):
      File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\data\dataset.py", line 5030, in types
        arrow_types.append(pa.from_numpy_dtype(dtype))
      File "pyarrow\types.pxi", line 4909, in pyarrow.lib.from_numpy_dtype
    TypeError: Cannot interpret 'CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)' as a data type
    
    Process finished with exit code 0

Versions / Dependencies

Package Version


aiohttp 3.9.1 aiohttp-cors 0.7.0 aiorwlock 1.3.0 aiosignal 1.3.1 ansicon 1.89.0 anyio 3.7.1 arrow 1.3.0 async-timeout 4.0.3 attrs 23.1.0 backoff 2.2.1 blessed 1.20.0 cachetools 5.3.2 certifi 2023.11.17 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 colorful 0.5.5 dask 2023.5.0 Deprecated 1.2.14 distlib 0.3.7 dm-tree 0.1.8 exceptiongroup 1.2.0 Farama-Notifications 0.0.4 fastapi 0.104.1 filelock 3.13.1 frozenlist 1.4.0 fsspec 2023.10.0 google-api-core 2.14.0 google-auth 2.23.4 googleapis-common-protos 1.61.0 gpustat 1.1.1 grpcio 1.59.3 gymnasium 0.28.1 h11 0.14.0 httptools 0.6.1 idna 3.6 imageio 2.33.0 importlib-metadata 6.8.0 importlib-resources 6.1.1 jax-jumpy 1.0.0 Jinja2 3.1.2 jinxed 1.2.0 joblib 1.3.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.1 lazy_loader 0.3 locket 1.0.0 lz4 4.3.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 mdurl 0.1.2 modin 0.23.1.post0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 mysql-connector-python 8.0.31 netifaces 0.11.0 networkx 3.1 numpy 1.24.4 nvidia-ml-py 12.535.133 opencensus 0.11.3 opencensus-context 0.1.3 opentelemetry-api 1.21.0 opentelemetry-exporter-otlp 1.21.0 opentelemetry-exporter-otlp-proto-common 1.21.0 opentelemetry-exporter-otlp-proto-grpc 1.21.0 opentelemetry-exporter-otlp-proto-http 1.21.0 opentelemetry-proto 1.21.0 opentelemetry-sdk 1.21.0 opentelemetry-semantic-conventions 0.42b0 packaging 23.2 pandas 2.0.3 partd 1.4.1 Pillow 10.1.0 pip 23.3.1 pkgutil_resolve_name 1.3.10 platformdirs 3.11.0 prometheus-client 0.19.0 protobuf 3.19.6 psutil 5.9.6 py-spy 0.3.14 py4j 0.10.9.5 pyarrow 14.0.1 pyasn1 0.5.1 pyasn1-modules 0.3.0 pydantic 1.10.13 Pygments 2.17.2 pyspark 3.3.2 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2023.3.post1 PyWavelets 1.4.1 PyYAML 6.0.1 ray 2.8.0 ray-cpp 2.8.0 raydp 1.6.0 referencing 0.31.1 requests 2.31.0 rich 13.7.0 rpds-py 0.13.2 rsa 4.9 scikit-image 0.21.0 scikit-learn 1.1.3 scipy 1.10.1 setuptools 68.2.0 six 1.16.0 smart-open 6.4.0 sniffio 1.3.0 starlette 0.27.0 sympy 1.12 threadpoolctl 3.2.0 tifffile 2023.7.10 toolz 0.12.0 torch 2.1.1 tqdm 4.66.1 typer 0.9.0 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 tzdata 2023.3 urllib3 2.1.0 uvicorn 0.24.0.post1 virtualenv 20.21.0 watchfiles 0.21.0 wcwidth 0.2.12 websockets 12.0 wheel 0.41.2 wrapt 1.16.0 yarl 1.9.3 zipp 3.17.0

Reproduction script

import pandas as pd
import ray
from ray.data.preprocessors import Categorizer

df = pd.DataFrame(
        {
                "sex": ["male", "female", "male", "female"],
                "level": ["L4", "L5", "L3", "L4"],
        })
ds = ray.data.from_pandas(df)
categorizer = Categorizer(columns=["sex", "level"])
print(categorizer.fit_transform(ds).schema().types)

Issue Severity

Low: It annoys or frustrates me.

kylebeni commented 2 months ago

I'm also facing this error on Databricks Runtime 14.3 LTS ML. Pretty frustrating since these are the official docs and a basic example doesn't work. Using ray 2.31.0.

!pip install ray

import pandas as pd
import ray
from ray.data.preprocessors import Categorizer
df = pd.DataFrame(
{
    "sex": ["male", "female", "male", "female"],
    "level": ["L4", "L5", "L3", "L4"],
})
ds = ray.data.from_pandas(df)  
categorizer = Categorizer(columns=["sex", "level"])
categorizer.fit_transform(ds).schema().types  
2024-07-08 19:39:52,260 INFO worker.py:1771 -- Started a local Ray instance.
2024-07-08 19:39:54,591 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /local_disk0/tmp/ray/session_2024-07-08_19-39-50_338959_26580/logs/ray-data
2024-07-08 19:39:54,592 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(get_pd_value_counts)]

2024-07-08 19:39:54,666 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /local_disk0/tmp/ray/session_2024-07-08_19-39-50_338959_26580/logs/ray-data
2024-07-08 19:39:54,667 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Categorizer] -> LimitOperator[limit=1]

2024-07-08 19:39:54,729 ERROR dataset.py:5042 -- Error converting dtype category to Arrow.
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-ae44d647-2f80-4aa0-ae4b-d2345a655f23/lib/python3.10/site-packages/ray/data/dataset.py", line 5038, in types
    arrow_types.append(pa.from_numpy_dtype(dtype))
  File "pyarrow/types.pxi", line 3243, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'CategoricalDtype(categories=['female', 'male'], ordered=False)' as a data type
2024-07-08 19:39:54,731 ERROR dataset.py:5042 -- Error converting dtype category to Arrow.
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-ae44d647-2f80-4aa0-ae4b-d2345a655f23/lib/python3.10/site-packages/ray/data/dataset.py", line 5038, in types
    arrow_types.append(pa.from_numpy_dtype(dtype))
  File "pyarrow/types.pxi", line 3243, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)' as a data type
[None, None]
dvmorris commented 1 month ago

Did you make any progress on this? I'm experiencing the same issue.

kylebeni commented 1 month ago

No, I gave up. @anyscalesam @raulchen any thoughts?