microsoft / vattention

Dynamic Memory Management for Serving LLMs without PagedAttention
MIT License
248 stars 16 forks source link

CPU memory leaking? #14

Open JasonHe-WQ opened 3 months ago

JasonHe-WQ commented 3 months ago

Desc

On a single server with one Tesla T4 and 32GiB CPU memory, it always OOM when IDLE. I can't even wait for a single bench mark completed.

Env

Bare metal server Driver Version: 545.23.06 CUDA Version: 12.3 OS: Ubuntu 22.04.4 LTS x86_64 Host: NUC9VXQNX K47173-406 Kernel: 6.5.0-41-generic ray 2.34.0

More details

running command

python -m sarathi.entrypoints.openai_server.api_server --model_name 01-ai/Yi-6B-200k --model_tensor_parallel_degree 1 --model_attention_backend fi_vattn --model_block_size 16

image

The sarathi serve took over 25GiB on a ray worker

Traceback stack by Python

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/vattention/sarathi-lean/sarathi/entrypoints/openai_server/api_server.py", line 125, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/root/vattention/sarathi-lean/sarathi/engine/async_llm_engine.py", line 274, in from_engine_args
    engine = super().from_engine_args(**kwargs)
  File "/root/vattention/sarathi-lean/sarathi/engine/llm_engine.py", line 17, in from_engine_args
    engine = BaseLLMEngine(*engine_configs)
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 110, in __init__
    self._init_cache()
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 226, in _init_cache
    output_all = self._run_workers(
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 425, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 873, in get_objects
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.16.251.111, ID: 9c2b341f18d9991990f509490a5d70abf52bf71f36aefd5f7669e122) where the task (actor ID: 7a14d0306d676302c532723201000000, name=RayWorker.__init__, pid=90411, memory used=13.79GB) was running was 29.48GB / 31.01GB (0.950528), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.16.251.111`. To see the logs of the worker, use `ray logs worker-6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80*out -ip 172.16.251.111. Top 10 memory users:
PID     MEM(GB) COMMAND
90411   13.79   ray::RayWorker.execute_method
89330   0.36    python -m sarathi.entrypoints.openai_server.api_server --model_name 01-ai/Yi-6B-200k --model_tensor_...
89353   0.10    /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
40794   0.07    /usr/libexec/fwupd/fwupd
89426   0.06    /usr/bin/python /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 ...
89540   0.05    /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=...
89425   0.04    /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs...
89566   0.04    ray::IDLE
89568   0.04    ray::IDLE
89570   0.04    ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.