vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.83k stars 4.5k forks source link

does cpu memory have to be greater than model size when using distributed inference? #738

Closed lynnleelhl closed 1 year ago

lynnleelhl commented 1 year ago

I have 2 hosts each has cpu memory 16G and gpu memory 24G, when I tried to load vicuna-13b, it get OOM, here's the error message:

(raylet, ip=10.0.6.140) [2023-08-11 10:13:54,613 E 30 30] (raylet) node_manager.cc:3084: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 73fcbc20ed501941efd210b9f60a78dbec0cf0f75fec41d65f19b505, IP: 10.0.6.140) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.0.6.140`
(raylet, ip=10.0.6.140)
(raylet, ip=10.0.6.140) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/api_server.py", line 78, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 232, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 55, in __init__
    self.engine = engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 99, in __init__
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 170, in _init_workers_ray
    self._run_workers(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 474, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 2522, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

the config is

INFO 08-11 10:13:01 llm_engine.py:70] Initializing an LLM engine with config: model='lmsys/vicuna-13b-v1.3', tokenizer='lmsys/vicuna-13b-v1.3', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)

does each host's cpu memory must be greater than the model size even when using distributed inference?

Gregory-Ledray commented 1 year ago

I am not a maintainer and I do not know the answer.

One way to test your hypothesis would be to set --gpu-memory-utilization GPU_MEMORY_UTILIZATION parameter so that the GPU memory utilization is just below 16 GB. If that prevents the problem from happening that would imply that you're right.

lynnleelhl commented 1 year ago

close as I found the error disappeared in the newest master code