ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

Cannot use the tokenizer for ChatGLM3-6b in Ray #42731

Open lixiaojun2914 opened 9 months ago

lixiaojun2914 commented 9 months ago

What happened + What you expected to happen

When I use the tokenizer for ChatGLM3 in Ray's data processing, it throws an error:ModuleNotFoundError: No module named 'transformers_modules'

Versions / Dependencies

accelerate==0.25.0 aiobotocore @ file:///croot/aiobotocore_1701291493089/work aiofiles==23.2.1 aiohttp @ file:///croot/aiohttp_1701112538292/work aiohttp-cors==0.7.0 aioitertools @ file:///tmp/build/80754af9/aioitertools_1607109665762/work aiorwlock==1.3.0 aiosignal @ file:///tmp/build/80754af9/aiosignal_1637843061372/work altair==5.2.0 annotated-types==0.6.0 anyio==3.7.1 async-timeout @ file:///croot/async-timeout_1703096998144/work attrs @ file:///croot/attrs_1695717823297/work backoff==2.2.1 bitsandbytes==0.41.3.post2 blessed==1.20.0 botocore @ file:///croot/botocore_1701286451219/work Brotli @ file:///tmp/abs_ecyw11_7ze/croots/recipe/brotli-split_1659616059936/work cachetools==5.3.2 certifi @ file:///croot/certifi_1700501669400/work/certifi cffi @ file:///croot/cffi_1700254295673/work charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 colorful==0.5.5 contourpy==1.1.1 cryptography @ file:///croot/cryptography_1702070282333/work cycler==0.12.1 datasets==2.16.1 debugpy==1.8.0 deepspeed==0.12.2 Deprecated==1.2.14 dill==0.3.7 distlib==0.3.7 dm-tree==0.1.8 docstring-parser==0.15 einops==0.7.0 evaluate==0.4.1 exceptiongroup==1.2.0 Farama-Notifications==0.0.4 fastapi==0.104.1 ffmpy==0.3.1 filelock==3.13.1 fonttools==4.46.0 frozenlist @ file:///croot/frozenlist_1698702560391/work fsspec==2023.6.0 google-api-core==2.15.0 google-auth==2.25.2 googleapis-common-protos==1.62.0 gpustat==1.1.1 gradio==3.50.2 gradio_client==0.6.1 grpcio==1.60.0 gymnasium==0.28.1 h11==0.14.0 hjson==3.1.0 httpcore==1.0.2 httptools==0.6.1 httpx==0.25.2 huggingface-hub==0.20.3 idna @ file:///croot/idna_1666125576474/work imageio==2.33.1 importlib-metadata==6.11.0 importlib-resources==6.1.1 jax-jumpy==1.0.0 jieba==0.42.1 Jinja2==3.1.2 jmespath @ file:///croot/jmespath_1700144569655/work joblib==1.3.2 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 kiwisolver==1.4.5 lazy_loader==0.3 lz4==4.3.2 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.7.4 mdurl==0.1.2 mpi4py @ file:///croot/mpi4py_1671223370575/work mpmath==1.3.0 msgpack==1.0.7 multidict @ file:///croot/multidict_1701096859099/work multiprocess==0.70.15 networkx==3.1 ninja==1.11.1.1 nltk==3.8.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.535.133 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 opencensus==0.11.3 opencensus-context==0.1.3 opentelemetry-api==1.21.0 opentelemetry-exporter-otlp==1.21.0 opentelemetry-exporter-otlp-proto-common==1.21.0 opentelemetry-exporter-otlp-proto-grpc==1.21.0 opentelemetry-exporter-otlp-proto-http==1.21.0 opentelemetry-proto==1.21.0 opentelemetry-sdk==1.21.0 opentelemetry-semantic-conventions==0.42b0 orjson==3.9.10 packaging==23.2 pandas==2.0.3 peft==0.6.0 Pillow==10.1.0 pkgutil_resolve_name==1.3.10 platformdirs==3.11.0 prometheus-client==0.19.0 protobuf==3.20.3 psutil==5.9.6 py-cpuinfo==9.0.0 py-spy==0.3.14 pyarrow==14.0.1 pyarrow-hotfix==0.6 pyasn1==0.5.1 pyasn1-modules==0.3.0 pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pydantic==1.10.13 pydantic_core==2.14.5 pydub==0.25.1 Pygments==2.17.2 pynvml==11.5.0 pyOpenSSL @ file:///croot/pyopenssl_1690223430423/work pyparsing==3.1.1 PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work python-dotenv==1.0.0 python-multipart==0.0.6 python-snappy==0.6.1 pytz==2023.3.post1 PyWavelets==1.4.1 PyYAML==6.0.1 ray==2.8.1 ray-cpp==2.8.1 referencing==0.32.0 regex==2023.10.3 requests==2.31.0 responses==0.18.0 rich==13.7.0 rouge-chinese==1.0.3 rpds-py==0.13.2 rsa==4.9 s3fs @ file:///croot/s3fs_1701294169021/work safetensors==0.4.1 scikit-image==0.21.0 scipy==1.10.1 semantic-version==2.10.0 sentencepiece==0.1.99 shtab==1.6.5 six @ file:///tmp/build/80754af9/six_1644875935023/work smart-open==6.4.0 sniffio==1.3.0 sse-starlette==1.8.2 starlette==0.27.0 sympy==1.12 tensorboardX==2.6.2.2 tifffile==2023.7.10 tiktoken==0.5.2 tokenizers==0.15.1 toolz==0.12.0 torch==2.1.0 tqdm==4.66.1 transformers==4.37.0 transformers-stream-generator==0.0.4 triton==2.1.0 trl==0.7.4 typer==0.9.0 typing_extensions==4.8.0 tyro==0.6.0 tzdata==2023.3 urllib3 @ file:///croot/urllib3_1698257533958/work uvicorn==0.24.0.post1 uvloop==0.19.0 virtualenv==20.21.0 watchfiles==0.21.0 wcwidth==0.2.12 websockets==11.0.3 wrapt @ file:///tmp/abs_c335821b-6e43-4504-9816-b1a52d3d3e1eel6uae8l/croots/recipe/wrapt_1657814400492/work xxhash==3.4.1 yarl @ file:///croot/yarl_1701105127787/work zipp==3.17.0

Reproduction script

import ray
import datasets
import ray.data
from ray.data import Dataset
from transformers import AutoTokenizer

ray.init()

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)

data = ray.data.read_csv("test.csv")

def test_ray(data: Dataset) -> Dataset:
    t = tokenizer.encode("test")
    print(tokenizer.decode(t))
    return data

data = data.map_batches(test_ray, fn_args=[tokenizer])
print(data.take(1))

Issue Severity

High: It blocks me from completing my task.

lixiaojun2914 commented 9 months ago

Reproduction script

import ray
import datasets
import ray.data
from ray.data import Dataset
from transformers import AutoTokenizer

ray.init()

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)

data = ray.data.read_csv("test.csv")

def test_ray(data: Dataset) -> Dataset:
    t = tokenizer.encode("test")
    print(tokenizer.decode(t))
    return data

data = data.map_batches(test_ray)
print(data.take(1))
JarHMJ commented 7 months ago

mark

gujingit commented 6 months ago

same error

(RayWorkerWrapper pid=577, ip=192.168.10.55) No module named 'transformers_modules'
(RayWorkerWrapper pid=577, ip=192.168.10.55) Traceback (most recent call last):
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 412, in deserialize_objects
(RayWorkerWrapper pid=577, ip=192.168.10.55)     obj = self._deserialize_object(data, metadata, object_ref)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 271, in _deserialize_object
(RayWorkerWrapper pid=577, ip=192.168.10.55)     return self._deserialize_msgpack_data(data, metadata_fields)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 226, in _deserialize_msgpack_data
(RayWorkerWrapper pid=577, ip=192.168.10.55)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 216, in _deserialize_pickle5_data
(RayWorkerWrapper pid=577, ip=192.168.10.55)     obj = pickle.loads(in_band)
(RayWorkerWrapper pid=577, ip=192.168.10.55) ModuleNotFoundError: No module named 'transformers_modules'
LSX-Sneakerprogrammer commented 6 months ago

same error

(RayWorkerWrapper pid=577, ip=192.168.10.55) No module named 'transformers_modules'
(RayWorkerWrapper pid=577, ip=192.168.10.55) Traceback (most recent call last):
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 412, in deserialize_objects
(RayWorkerWrapper pid=577, ip=192.168.10.55)     obj = self._deserialize_object(data, metadata, object_ref)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 271, in _deserialize_object
(RayWorkerWrapper pid=577, ip=192.168.10.55)     return self._deserialize_msgpack_data(data, metadata_fields)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 226, in _deserialize_msgpack_data
(RayWorkerWrapper pid=577, ip=192.168.10.55)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(RayWorkerWrapper pid=577, ip=192.168.10.55)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 216, in _deserialize_pickle5_data
(RayWorkerWrapper pid=577, ip=192.168.10.55)     obj = pickle.loads(in_band)
(RayWorkerWrapper pid=577, ip=192.168.10.55) ModuleNotFoundError: No module named 'transformers_modules'

I have met the same error. Strangely, I did not change the code and it worked one day ago. Does any one solve this problem?