vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.85k stars 4.51k forks source link

Ray worker out of memory #2669

Open tristan279 opened 9 months ago

tristan279 commented 9 months ago

Trying to spin a server with an asyncengine, with 'use_ray' on true. after a few hours, i get the following error:

Memory on the node (IP: 169.254.181.2, ID: 708c7baf966d59aa3f08299830c349ca055293ebb1c33d8e72cd3336) where the task (actor ID: 0dab4ab45f6c947201afac6d01000000, name=RayWorkerVllm.init, pid=308, memory used=11.15GB) was running was 12.49GB / 13.15GB (0.950003), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 79a553ea91fe46f95e8384ddf8a8f0a01e3418a975ecd0af983c7bb2) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 169.254.181.2. To see the logs of the worker, use `ray logs worker-79a553ea91fe46f95e8384ddf8a8f0a01e3418a975ecd0af983c7bb2*out -ip 169.254.181.2. Top 10 memory users:

... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms to zero.

... vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

tacacs1101-debug commented 9 months ago

I am also getting the same error.

meichangsu1 commented 8 months ago

same error in 0.2.7 @zhuohan123

A-Posthuman commented 8 months ago

also hit this sometimes on 0.3.0

WangxuP commented 4 months ago

+1

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!