OOM error :When using four 4500 ada cards to start four lora instances, an error occurs. However, no error occurs when not starting lora on the four 4500 ada cards, and there is no error when starting four lora instances on a single A100 card. - Githubissues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.49k stars 4.43k forks source link

OOM error :When using four 4500 ada cards to start four lora instances, an error occurs. However, no error occurs when not starting lora on the four 4500 ada cards, and there is no error when starting four lora instances on a single A100 card. #9542

Open xllrun opened 1 week ago

xllrun commented 1 week ago

Your current environment

vllm版本:0.6.2 4500 ada: 24G*4显存 A100: 80G显存 model: yi1.5-34b-chat-16k

Model Input Dumps

err_execute_model_input_20241021-023236.zip

🐛 Describe the bug

INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=232)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=230)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=231)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=230)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. [1;36m(VllmWorkerProcess pid=232)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. [1;36m(VllmWorkerProcess pid=231)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. WARNING 10-21 02:32:36 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind..weak_bound' [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: Error in model execution (input dumped to /tmp/err_execute_model_input_20241021-023236.pkl): CUDA out of memory. Tried to allocate 220.00 MiB. GPU 3 has a total capacity of 23.65 GiB of which 100.62 MiB is free. Process 132987 has 23.53 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, with 30.64 MiB allocated in private pools (e.g., CUDA Graphs), and 376.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last): [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return func(*args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1590, in execute_model [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable( [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 448, in forward [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] model_output = self.model(input_ids, positions, kv_caches, [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(*args, *kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_states, residual = layer( [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(*args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 261, in forward [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_states = self.mlp(hidden_states) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 89, in forward [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_workerutils.py:233] x, = self.down_proj(x) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, **kwargs) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 983, in forward [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] output_parallel = self.apply(input_parallel) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 955, in apply [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] output = self.base_layer.quant_method.apply(self.base_layer, x) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return F.linear(x, layer.weight, bias) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 220.00 MiB. GPU 3 has a total capacity of 23.65 GiB of which 100.62 MiB is free. Process 132987 has 23.53 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, with 30.64 MiB allocated in private pools (e.g., CUDA Graphs), and 376.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception: [1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233]

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

xllrun commented 1 week ago

command：vllm serve /home/basemodels/01-ai/Yi-1.5-34B-Chat-16K --served-model-name "yimodel" --enable-lora --lora-modules task1=/home/lora_model_files/sft_task1 task2=/home/lora_model_files/sft_task2 task3=/home/lora_model_files/sft_task3 --port 8776 --disable-log-requests --tensor_parallel_size=4 --gpu-memory-utilization 0.95 --max-loras 4