On a single server with one Tesla T4 and 32GiB CPU memory, it always OOM when IDLE. I can't even wait for a single bench mark completed.
Env
Bare metal server
Driver Version: 545.23.06 CUDA Version: 12.3
OS: Ubuntu 22.04.4 LTS x86_64
Host: NUC9VXQNX K47173-406
Kernel: 6.5.0-41-generic
ray 2.34.0
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/vattention/sarathi-lean/sarathi/entrypoints/openai_server/api_server.py", line 125, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/root/vattention/sarathi-lean/sarathi/engine/async_llm_engine.py", line 274, in from_engine_args
engine = super().from_engine_args(**kwargs)
File "/root/vattention/sarathi-lean/sarathi/engine/llm_engine.py", line 17, in from_engine_args
engine = BaseLLMEngine(*engine_configs)
File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 110, in __init__
self._init_cache()
File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 226, in _init_cache
output_all = self._run_workers(
File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 425, in _run_workers
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 873, in get_objects
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.16.251.111, ID: 9c2b341f18d9991990f509490a5d70abf52bf71f36aefd5f7669e122) where the task (actor ID: 7a14d0306d676302c532723201000000, name=RayWorker.__init__, pid=90411, memory used=13.79GB) was running was 29.48GB / 31.01GB (0.950528), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.16.251.111`. To see the logs of the worker, use `ray logs worker-6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80*out -ip 172.16.251.111. Top 10 memory users:
PID MEM(GB) COMMAND
90411 13.79 ray::RayWorker.execute_method
89330 0.36 python -m sarathi.entrypoints.openai_server.api_server --model_name 01-ai/Yi-6B-200k --model_tensor_...
89353 0.10 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
40794 0.07 /usr/libexec/fwupd/fwupd
89426 0.06 /usr/bin/python /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 ...
89540 0.05 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=...
89425 0.04 /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs...
89566 0.04 ray::IDLE
89568 0.04 ray::IDLE
89570 0.04 ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Desc
On a single server with one Tesla T4 and 32GiB CPU memory, it always OOM when IDLE. I can't even wait for a single bench mark completed.
Env
Bare metal server Driver Version: 545.23.06 CUDA Version: 12.3 OS: Ubuntu 22.04.4 LTS x86_64 Host: NUC9VXQNX K47173-406 Kernel: 6.5.0-41-generic ray 2.34.0
More details
running command
The sarathi serve took over 25GiB on a ray worker
Traceback stack by Python