Open tristan279 opened 9 months ago
I am also getting the same error.
same error in 0.2.7 @zhuohan123
also hit this sometimes on 0.3.0
+1
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Trying to spin a server with an asyncengine, with 'use_ray' on true. after a few hours, i get the following error:
Memory on the node (IP: 169.254.181.2, ID: 708c7baf966d59aa3f08299830c349ca055293ebb1c33d8e72cd3336) where the task (actor ID: 0dab4ab45f6c947201afac6d01000000, name=RayWorkerVllm.init, pid=308, memory used=11.15GB) was running was 12.49GB / 13.15GB (0.950003), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 79a553ea91fe46f95e8384ddf8a8f0a01e3418a975ecd0af983c7bb2) because it was the most recently scheduled task; to see more information about memory usage on this node, use
ray logs raylet.out -ip 169.254.181.2
. To see the logs of the worker, use `ray logs worker-79a553ea91fe46f95e8384ddf8a8f0a01e3418a975ecd0af983c7bb2*out -ip 169.254.181.2. Top 10 memory users:... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable
RAY_memory_usage_threshold
when starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms
to zero.... vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.