Open bks5881 opened 6 months ago
any solutions? I also got this question. when check the logs, work node have this errors:
2024-06-11 09:46:55,039 C 510 510] (raylet) node_manager.cc:1028: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded. StackTrace Information /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbc2f9a) [0x5643c763af9a] ray::operator<<() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xbc52b1) [0x5643c763d2b1] ray::RayLog::~RayLog() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x2fdaf1) [0x5643c6d75af1] ray::raylet::NodeManager::NodeRemoved() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x4d0424) [0x5643c6f48424] ray::gcs::NodeInfoAccessor::HandleNotification() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x5ed47c) [0x5643c706547c] EventTracker::RecordExecution() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x5e905e) [0x5643c706105e] std::_Function_handler<>::_M_invoke() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x5e94d6) [0x5643c70614d6] boost::asio::detail::completion_handler<>::do_complete() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xca51fb) [0x5643c771d1fb] boost::asio::detail::scheduler::do_run_one() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xca7789) [0x5643c771f789] boost::asio::detail::scheduler::run() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0xca7ca2) [0x5643c771fca2] boost::asio::io_context::run() /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x1d6e31) [0x5643c6c4ee31] main /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f7acf4d1d90] /usr/lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0x80) [0x7f7acf4d1e40] libc_start_main /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet(+0x22d847) [0x5643c6ca5847]
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
I am running vllm on using ray on two machines each having 4 A100 79Gb. I ran the commands ray start head and ray start address on head and child node. when i run ray status I see I have 8 GPUs. In the next step when I launch vllm with tp 8, i get the error as follows
When i check the ray status again, i only see 4 GPUs. I am not sure why ray cant see my 8 GPus after i try to launch it with vllm when it is obviously visible before. I use the following command to launch
/python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-34b-code-instruct --worker-use-ray --tensor-parallel-size 8 --trust-remote-code --port 40023 --host 0.0.0.0 --gpu-memory-utilization .65 --tokenizer ibm-granite/granite-34b-code-instruct --worker-use-ray