ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.04k stars 5.78k forks source link

ray failed to register worker when I used vllm #39618

Open Amanda-Barbara opened 1 year ago

Amanda-Barbara commented 1 year ago

What happened + What you expected to happen

the error log of terminal:

[2023-09-13 02:28:19,539 E 119771 119771] core_worker.cc:201: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory,

the background log of ray raylet:

[2023-09-13 02:59:15,635 I 123317 123317] (raylet) worker_pool.cc:489: Started worker process with pid 123567, the token is 126
[2023-09-13 02:59:15,637 I 123317 123317] (raylet) worker_pool.cc:489: Started worker process with pid 123568, the token is 127
[2023-09-13 02:59:15,637 W 123317 123317] (raylet) client_connection.cc:528: [worker]ProcessMessage with type RegisterClientRequest took 6982 ms.
[2023-09-13 02:59:18,671 W 123317 123322] (raylet) metric_exporter.cc:212: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-09-13 02:59:38,633 W 123317 123317] (raylet) agent_manager.cc:115: Agent process expected id 424238335 timed out before registering. ip , id 0
[2023-09-13 02:59:38,818 I 123317 123377] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip . id 0
[2023-09-13 02:59:38,818 E 123317 123377] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).
[2023-09-13 02:59:38,818 I 123317 123317] (raylet) main.cc:334: Raylet received SIGTERM, shutting down...
[2023-09-13 02:59:38,818 I 123317 123317] (raylet) accessor.cc:435: Unregistering node info, node id = ddc4ed607a52c8df9d960c04aa6765a66b43cc3fd5e078768e19957b
[2023-09-13 02:59:38,819 I 123317 123317] (raylet) io_service_pool.cc:47: IOServicePool is stopped.
[2023-09-13 02:59:38,888 I 123317 123317] (raylet) stats.h:128: Stats module has shutdown.

Versions / Dependencies

ray 2.6.3 grpcio 1.57.0 grpcio-reflection 1.57.0 grpcio-status 1.57.0 grpcio-tools 1.51.1

Reproduction script

#!/bin/bash

model=/home/user/models/Llama-2-13b-hf
host=127.0.0.1
port=5679
tokenizer=$model
tensor_parallel_size=8
gpu_memory_utilization=0.9
swap_space=16

echo $tensor_parallel_size
python -m vllm.entrypoints.api_server \
        --host=$host \
        --port=$port \
        --model=$model \
        --tokenizer=$tokenizer  \
        --tensor-parallel-size=$tensor_parallel_size \
        --gpu-memory-utilization=$gpu_memory_utilization \
        --swap-space=$swap_space \
        --engine-use-ray
#        --disable-log-requests \
#        --max-num-batched-tokens 8192
#        --disable-log-stats \

Issue Severity

High: It blocks me from completing my task.

Amanda-Barbara commented 1 year ago

@LorrinWWW I have found the cause of the problem, I have not started the runtime of ray cluster server,just type input these commands like this:

# start the ray runtime
ray start --head --port port_number

# add the node to this ray cluster
ray start --address='10.104.8.83:port_number'

# the setting of below host is different from 10.104.8.83 above-mentioned
python -m vllm.entrypoints.api_server \
        --host=$host \
        --port=$port \
        --model=$model \
        --tokenizer=$tokenizer  \
        --tensor-parallel-size=$tensor_parallel_size

# terminate the ray runtime when not in use
ray stop
rkooo567 commented 1 year ago

Looks like the dashboard agent died from the logs. can you give us the log from dashboard_agent.log when this happens?

imperio-wxm commented 1 year ago

I have the same problem, ray==2.5.0 grpcio==1.48, any progress?

rkooo567 commented 1 year ago

can you try with ray 2.7? We removed grpcio requirement from the dashboard agent which is highly likely a root cause of this issue

rkooo567 commented 1 year ago

I will assign P2 until it is followed up. Please try. I couldn't repro it with ray 2.6.3 + grpcio 1.57

  1. update the grpcio version (1.48 has been yanked)
  2. update dashboard_agent.log file contents
anyscalesam commented 9 months ago

tag @Amanda-Barbara