ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.2k stars 87 forks source link

Error when `serve run` #144

Open andakai opened 3 months ago

andakai commented 3 months ago

I build the image and run the container.

docker run -d -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:/home/ray/data anyscale/ray-llm:latest

But in the container, when I run the command:

serve run ~/serve_configs/amazon--LightGPT.yaml

The error is :

2024-03-25 02:40:33,460 INFO scripts.py:411 -- Running config file: '/home/ray/serve_configs/amazon--LightGPT.yaml'.
2024-03-25 02:40:35,709 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 1073741824 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-03-25 02:40:36,866 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(ServeController pid=22583) WARNING 2024-03-25 02:40:39,253 controller 22583 logging_utils.py:247 - 'RAY_SERVE_ENABLE_JSON_LOGGING' is deprecated, please use 'LoggingConfig' to enable json format.
(ProxyActor pid=22664) WARNING 2024-03-25 02:40:40,795 proxy 172.17.0.2 logging_utils.py:247 - 'RAY_SERVE_ENABLE_JSON_LOGGING' is deprecated, please use 'LoggingConfig' to enable json format.
(ProxyActor pid=22664) INFO 2024-03-25 02:40:40,795 proxy 172.17.0.2 proxy.py:1141 - Proxy actor 28469263dc5907e200fa9fe201000000 starting on node 87617dc5750c5f36331a1ea5935a849259fee4d4a42262c695d9e0ca.
(ProxyActor pid=22664) INFO 2024-03-25 02:40:40,801 proxy 172.17.0.2 proxy.py:1346 - Starting HTTP server on node: 87617dc5750c5f36331a1ea5935a849259fee4d4a42262c695d9e0ca listening on port 8000
(ProxyActor pid=22664) INFO:     Started server process [22664]
(ProxyActor pid=22664) WARNING 2024-03-25 02:40:40,824 proxy 172.17.0.2 logging_utils.py:247 - 'RAY_SERVE_ENABLE_JSON_LOGGING' is deprecated, please use 'LoggingConfig' to enable json format.
2024-03-25 02:40:40,848 SUCC scripts.py:480 -- Submitted deploy config successfully.
(ServeController pid=22583) INFO 2024-03-25 02:40:40,841 controller 22583 application_state.py:414 - Building application 'ray-llm'.
(build_serve_application pid=21125) There was a problem when trying to write in your cache folder (/home/adk/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(ServeController pid=22583) WARNING 2024-03-25 02:40:48,038 controller 22583 application_state.py:742 - Deploying app 'ray-llm' failed with exception:
(ServeController pid=22583) Traceback (most recent call last):
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/_private/application_state.py", line 994, in build_serve_application
(ServeController pid=22583)     app = call_app_builder_with_args_if_necessary(import_attr(import_path), args)
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/utils.py", line 1182, in import_attr
(ServeController pid=22583)     module = importlib.import_module(module_name)
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/importlib/__init__.py", line 127, in import_module
(ServeController pid=22583)     return _bootstrap._gcd_import(name[level:], package, level)
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
(ServeController pid=22583)   File "<frozen importlib._bootstrap_external>", line 850, in exec_module
(ServeController pid=22583)   File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/__init__.py", line 1, in <module>
(ServeController pid=22583)     from rayllm.backend.observability.tracing import setup_tracing
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/backend/__init__.py", line 1, in <module>
(ServeController pid=22583)     from rayllm.backend.server.run import router_application
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/backend/server/run.py", line 10, in <module>
(ServeController pid=22583)     from rayllm.backend.llm.vllm.vllm_engine import VLLMEngine
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/backend/llm/vllm/vllm_engine.py", line 15, in <module>
(ServeController pid=22583)     from rayllm.backend.llm.vllm.vllm_compatibility import AviaryAsyncLLMEngine
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/rayllm/backend/llm/vllm/vllm_compatibility.py", line 31, in <module>
(ServeController pid=22583)     init_hf_modules()
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 52, in init_hf_modules
(ServeController pid=22583)     os.makedirs(HF_MODULES_CACHE, exist_ok=True)
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=22583)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 215, in makedirs
(ServeController pid=22583)     makedirs(head, exist_ok=exist_ok)
(ServeController pid=22583)   File "/home/ray/anaconda3/lib/python3.9/os.py", line 225, in makedirs
(ServeController pid=22583)     mkdir(name, mode)
(ServeController pid=22583) PermissionError: [Errno 13] Permission denied: '/home/adk'
(ServeController pid=22583) 
(build_serve_application pid=21125) [8b5cff370a16:21125] [[48252,1],0] ORTE_ERROR_LOG: Unreachable in file runtime/ompi_mpi_finalize.c at line 262
kunalchamoli commented 3 months ago

Hello @darrenglow, I faced the same issue just change ~/data to /data in docker run command. This issue will be resolved.