Open Taurus-Le opened 10 months ago
I'm also facing this error on Databricks Runtime 14.3 LTS ML. Pretty frustrating since these are the official docs and a basic example doesn't work. Using ray 2.31.0.
!pip install ray
import pandas as pd
import ray
from ray.data.preprocessors import Categorizer
df = pd.DataFrame(
{
"sex": ["male", "female", "male", "female"],
"level": ["L4", "L5", "L3", "L4"],
})
ds = ray.data.from_pandas(df)
categorizer = Categorizer(columns=["sex", "level"])
categorizer.fit_transform(ds).schema().types
2024-07-08 19:39:52,260 INFO worker.py:1771 -- Started a local Ray instance.
2024-07-08 19:39:54,591 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /local_disk0/tmp/ray/session_2024-07-08_19-39-50_338959_26580/logs/ray-data
2024-07-08 19:39:54,592 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(get_pd_value_counts)]
2024-07-08 19:39:54,666 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /local_disk0/tmp/ray/session_2024-07-08_19-39-50_338959_26580/logs/ray-data
2024-07-08 19:39:54,667 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Categorizer] -> LimitOperator[limit=1]
2024-07-08 19:39:54,729 ERROR dataset.py:5042 -- Error converting dtype category to Arrow.
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-ae44d647-2f80-4aa0-ae4b-d2345a655f23/lib/python3.10/site-packages/ray/data/dataset.py", line 5038, in types
arrow_types.append(pa.from_numpy_dtype(dtype))
File "pyarrow/types.pxi", line 3243, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'CategoricalDtype(categories=['female', 'male'], ordered=False)' as a data type
2024-07-08 19:39:54,731 ERROR dataset.py:5042 -- Error converting dtype category to Arrow.
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-ae44d647-2f80-4aa0-ae4b-d2345a655f23/lib/python3.10/site-packages/ray/data/dataset.py", line 5038, in types
arrow_types.append(pa.from_numpy_dtype(dtype))
File "pyarrow/types.pxi", line 3243, in pyarrow.lib.from_numpy_dtype
TypeError: Cannot interpret 'CategoricalDtype(categories=['L3', 'L4', 'L5'], ordered=False)' as a data type
[None, None]
Did you make any progress on this? I'm experiencing the same issue.
No, I gave up. @anyscalesam @raulchen any thoughts?
What happened + What you expected to happen
CategoricalDtype
cannot be interpreted as a data type.Here is the logs:
Versions / Dependencies
Package Version
aiohttp 3.9.1 aiohttp-cors 0.7.0 aiorwlock 1.3.0 aiosignal 1.3.1 ansicon 1.89.0 anyio 3.7.1 arrow 1.3.0 async-timeout 4.0.3 attrs 23.1.0 backoff 2.2.1 blessed 1.20.0 cachetools 5.3.2 certifi 2023.11.17 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 colorful 0.5.5 dask 2023.5.0 Deprecated 1.2.14 distlib 0.3.7 dm-tree 0.1.8 exceptiongroup 1.2.0 Farama-Notifications 0.0.4 fastapi 0.104.1 filelock 3.13.1 frozenlist 1.4.0 fsspec 2023.10.0 google-api-core 2.14.0 google-auth 2.23.4 googleapis-common-protos 1.61.0 gpustat 1.1.1 grpcio 1.59.3 gymnasium 0.28.1 h11 0.14.0 httptools 0.6.1 idna 3.6 imageio 2.33.0 importlib-metadata 6.8.0 importlib-resources 6.1.1 jax-jumpy 1.0.0 Jinja2 3.1.2 jinxed 1.2.0 joblib 1.3.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.1 lazy_loader 0.3 locket 1.0.0 lz4 4.3.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 mdurl 0.1.2 modin 0.23.1.post0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 mysql-connector-python 8.0.31 netifaces 0.11.0 networkx 3.1 numpy 1.24.4 nvidia-ml-py 12.535.133 opencensus 0.11.3 opencensus-context 0.1.3 opentelemetry-api 1.21.0 opentelemetry-exporter-otlp 1.21.0 opentelemetry-exporter-otlp-proto-common 1.21.0 opentelemetry-exporter-otlp-proto-grpc 1.21.0 opentelemetry-exporter-otlp-proto-http 1.21.0 opentelemetry-proto 1.21.0 opentelemetry-sdk 1.21.0 opentelemetry-semantic-conventions 0.42b0 packaging 23.2 pandas 2.0.3 partd 1.4.1 Pillow 10.1.0 pip 23.3.1 pkgutil_resolve_name 1.3.10 platformdirs 3.11.0 prometheus-client 0.19.0 protobuf 3.19.6 psutil 5.9.6 py-spy 0.3.14 py4j 0.10.9.5 pyarrow 14.0.1 pyasn1 0.5.1 pyasn1-modules 0.3.0 pydantic 1.10.13 Pygments 2.17.2 pyspark 3.3.2 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2023.3.post1 PyWavelets 1.4.1 PyYAML 6.0.1 ray 2.8.0 ray-cpp 2.8.0 raydp 1.6.0 referencing 0.31.1 requests 2.31.0 rich 13.7.0 rpds-py 0.13.2 rsa 4.9 scikit-image 0.21.0 scikit-learn 1.1.3 scipy 1.10.1 setuptools 68.2.0 six 1.16.0 smart-open 6.4.0 sniffio 1.3.0 starlette 0.27.0 sympy 1.12 threadpoolctl 3.2.0 tifffile 2023.7.10 toolz 0.12.0 torch 2.1.1 tqdm 4.66.1 typer 0.9.0 types-python-dateutil 2.8.19.14 typing_extensions 4.8.0 tzdata 2023.3 urllib3 2.1.0 uvicorn 0.24.0.post1 virtualenv 20.21.0 watchfiles 0.21.0 wcwidth 0.2.12 websockets 12.0 wheel 0.41.2 wrapt 1.16.0 yarl 1.9.3 zipp 3.17.0
Reproduction script
Issue Severity
Low: It annoys or frustrates me.