Open aflah02 opened 2 weeks ago
are you using '--tensor-parallel-size 8'? 17840 seems to be small (at least for 70b)
I basically have same question.
I was getting ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (118208).
when trying to run Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf on 80 or 96Gb so I tried 160Gb and it stopped complaining, but when I try to send really large request vLLM crashes
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] , Traceback (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 69, in start_worker_execution_loop
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] output = self.model_runner.execute_model(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1450, in execute_model
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 429, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] model_output = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 320, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] hidden_states = self.get_input_embeddings(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 305, in get_input_embeddings
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self.embed_tokens(input_ids)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 391, in forward
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] masked_input, input_mask = get_masked_input_and_mask(
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: The following operation failed in the TorchScript interpreter.
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Traceback of TorchScript (most recent call last):
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] File "/root/vllm_env/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 153, in get_masked_input_and_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] vocab_mask = org_vocab_mask | added_vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] input_ = vocab_mask * (input_ - valid_offset)
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] return input_, ~vocab_mask
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] ~~~~~~~~~~~ <--- HERE
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] RuntimeError: CUDA error: invalid configuration argument
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=1570) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1572) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=1571) ERROR 09-10 11:08:30 multiproc_worker_utils.py:226]
INFO 09-10 11:08:33 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:08:43 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:08:53 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:03 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:13 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
INFO 09-10 11:09:23 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 25.5%, CPU KV cache usage: 0.0%.
ERROR 09-10 11:09:29 async_llm_engine.py:960] Engine iteration timed out. This should never happen!
ERROR 09-10 11:09:29 async_llm_engine.py:63] Engine background task failed
ERROR 09-10 11:09:29 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
ERROR 09-10 11:09:29 async_llm_engine.py:63] done, _ = await asyncio.wait(
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 09-10 11:09:29 async_llm_engine.py:63] return await _wait(fs, timeout, return_when, loop)
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 495, in _wait
ERROR 09-10 11:09:29 async_llm_engine.py:63] await waiter
ERROR 09-10 11:09:29 async_llm_engine.py:63] asyncio.exceptions.CancelledError
ERROR 09-10 11:09:29 async_llm_engine.py:63]
ERROR 09-10 11:09:29 async_llm_engine.py:63] During handling of the above exception, another exception occurred:
ERROR 09-10 11:09:29 async_llm_engine.py:63]
ERROR 09-10 11:09:29 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-10 11:09:29 async_llm_engine.py:63] return_value = task.result()
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
ERROR 09-10 11:09:29 async_llm_engine.py:63] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 09-10 11:09:29 async_llm_engine.py:63] self._do_exit(exc_type)
ERROR 09-10 11:09:29 async_llm_engine.py:63] File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 09-10 11:09:29 async_llm_engine.py:63] raise asyncio.TimeoutError
ERROR 09-10 11:09:29 async_llm_engine.py:63] asyncio.exceptions.TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x7fb22bd31a20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa8341295a0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fb22bd31a20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fa8341295a0>>)>
Traceback (most recent call last):
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
done, _ = await asyncio.wait(
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 495, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/root/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-10 11:09:29 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-10 11:09:29 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-10 11:09:29 client.py:412] Traceback (most recent call last):
ERROR 09-10 11:09:29 client.py:412] File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-10 11:09:29 client.py:412] await self.check_health(socket=socket)
ERROR 09-10 11:09:29 client.py:412] File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-10 11:09:29 client.py:412] await self._send_one_way_rpc_request(
ERROR 09-10 11:09:29 client.py:412] File "/root/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-10 11:09:29 client.py:412] raise response
ERROR 09-10 11:09:29 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 257, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 253, in wrap
await func()
File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
message = await receive()
File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
await self.message_event.wait()
File "/opt/conda/lib/python3.10/asyncio/locks.py", line 213, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f7266215ae0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/root/vllm_env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/root/vllm_env/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
await response(scope, receive, send)
File "/root/vllm_env/lib/python3.10/site-packages/starlette/responses.py", line 250, in __call__
async with anyio.create_task_group() as task_group:
File "/root/vllm_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
Your current environment
Libraries Installed -
How would you like to use vllm
Hi I want to run Llama 3.1 70B and 405B with 120K context length. I have access to several 8xH100 nodes however most tutorial code snippets give errors of the style
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (17840). Try increasing
gpu_memory_utilizationor decreasing
max_model_lenwhen initializing the engine
. I want to get an estimate of how many nodes each having 8 H100s do I need for both the models to get enough VRAM to run both the models at full context length.Before submitting a new issue...