Open xllrun opened 1 week ago
command:vllm serve /home/basemodels/01-ai/Yi-1.5-34B-Chat-16K --served-model-name "yimodel" --enable-lora --lora-modules task1=/home/lora_model_files/sft_task1 task2=/home/lora_model_files/sft_task2 task3=/home/lora_model_files/sft_task3 --port 8776 --disable-log-requests --tensor_parallel_size=4 --gpu-memory-utilization 0.95 --max-loras 4
Your current environment
vllm版本:0.6.2 4500 ada: 24G*4显存 A100: 80G显存 model: yi1.5-34b-chat-16k
Model Input Dumps
err_execute_model_input_20241021-023236.zip
🐛 Describe the bug
INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=232)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=230)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=231)[0;0m INFO 10-21 02:32:36 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl... [1;36m(VllmWorkerProcess pid=230)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. [1;36m(VllmWorkerProcess pid=232)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. [1;36m(VllmWorkerProcess pid=231)[0;0m INFO 10-21 02:32:36 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241021-023236.pkl. WARNING 10-21 02:32:36 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind..weak_bound'
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: Error in model execution (input dumped to /tmp/err_execute_model_input_20241021-023236.pkl): CUDA out of memory. Tried to allocate 220.00 MiB. GPU 3 has a total capacity of 23.65 GiB of which 100.62 MiB is free. Process 132987 has 23.53 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, with 30.64 MiB allocated in private pools (e.g., CUDA Graphs), and 376.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return func(*args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1590, in execute_model
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_or_intermediate_states = model_executable(
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 448, in forward
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] model_output = self.model(input_ids, positions, kv_caches,
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(*args, *kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_states, residual = layer(
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(*args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 261, in forward
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] hidden_states = self.mlp(hidden_states)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 89, in forward
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_workerutils.py:233] x, = self.down_proj(x)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return self._call_impl(*args, *kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return forward_call(args, **kwargs)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 983, in forward
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] output_parallel = self.apply(input_parallel)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 955, in apply
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] output = self.base_layer.quant_method.apply(self.base_layer, x)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 135, in apply
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] return F.linear(x, layer.weight, bias)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 220.00 MiB. GPU 3 has a total capacity of 23.65 GiB of which 100.62 MiB is free. Process 132987 has 23.53 GiB memory in use. Of the allocated memory 21.80 GiB is allocated by PyTorch, with 30.64 MiB allocated in private pools (e.g., CUDA Graphs), and 376.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233]
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233] The above exception was the direct cause of the following exception:
[1;36m(VllmWorkerProcess pid=232)[0;0m ERROR 10-21 02:32:36 multiproc_worker_utils.py:233]
Before submitting a new issue...