Open Amanda-Barbara opened 1 year ago
@LorrinWWW I have found the cause of the problem, I have not started the runtime of ray cluster server,just type input these commands like this:
# start the ray runtime
ray start --head --port port_number
# add the node to this ray cluster
ray start --address='10.104.8.83:port_number'
# the setting of below host is different from 10.104.8.83 above-mentioned
python -m vllm.entrypoints.api_server \
--host=$host \
--port=$port \
--model=$model \
--tokenizer=$tokenizer \
--tensor-parallel-size=$tensor_parallel_size
# terminate the ray runtime when not in use
ray stop
Looks like the dashboard agent died from the logs. can you give us the log from dashboard_agent.log
when this happens?
I have the same problem, ray==2.5.0 grpcio==1.48, any progress?
can you try with ray 2.7? We removed grpcio requirement from the dashboard agent which is highly likely a root cause of this issue
I will assign P2 until it is followed up. Please try. I couldn't repro it with ray 2.6.3 + grpcio 1.57
tag @Amanda-Barbara
What happened + What you expected to happen
the error log of terminal:
the background log of ray raylet:
Versions / Dependencies
ray 2.6.3 grpcio 1.57.0 grpcio-reflection 1.57.0 grpcio-status 1.57.0 grpcio-tools 1.51.1
Reproduction script
Issue Severity
High: It blocks me from completing my task.