Closed xingyaoww closed 6 months ago
I may also encounter this problem when generation tokens exceed 4096. Also the model start output gribbish. Might gribbish output let the kernel be unstable. Currently i limit max_tokens to 4096, and no more error.
We have this same issue and we're only trying to generate 1024 tokens. It's extremely frustrating. @WoosukKwon
I am seeing this error when running:
After I get this error the first time, it throws the same error on small prompts as well, until I restart. So I am forced to set --max-num-batched-tokens to 8129.
Any ideas how to work around this error?
Any updates @WoosukKwon? This bug is causing us problems in production.
I am encountering a similar issue on an A100 80G and I believe it has something to do with --max-num-batched-tokens
.
The stack trace is a bit different:
INFO 09-27 23:24:07 llm_engine.py:72] Initializing an LLM engine with config: model='/storage/vllm/models/Xwin-LM-70B-V0.1-AWQ', tokenizer='/storage/vllm/models/Xwin-LM-70B-V0.1-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
File "/local-llm-server/other/vllm/vllm_api_server.py", line 103, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
engine = cls(engine_args.worker_use_ray,
File "/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 270, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
return engine_class(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
self._init_cache()
File "/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 188, in _init_cache
num_blocks = self._run_workers(
File "/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 688, in _run_workers
output = executor(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 108, in profile_num_available_blocks
self.model(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 293, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 253, in forward
hidden_states = layer(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 200, in forward
hidden_states = self.self_attn(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 151, in forward
attn_output = self.attn(positions, q, k, v, k_cache, v_cache,
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 330, in forward
return super().forward(
File "/venv/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 205, in forward
self.multi_query_kv_attention(
File "/venv/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 109, in multi_query_kv_attention
key = torch.repeat_interleave(key, self.num_queries_per_kv, dim=1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
--max-num-batched-tokens
higher than 5120
seems to cause this exception. If I remember correctly, I got a similar issue on the A6000 but the max-num-batched-tokens
was able to be set over 8000. I don't think I've ever encountered this issue on my A4000 and IIRC I had it at like 9999.
I believe this should have been fixed in the latest 0.2.0 release.
same bug
Ran into this problem in 0.2.5 on A4500 card.
@robcaulk Could you share a reproducible script? Thanks.
Still happened in version 0.6.1.post2
While serving the CodeLLaMA 13B (
CodeLlama-13b-hf
) base model withv1/completions
API with 1 A100, I encountered the following CUDA memory issue. The same thing happened with the 34B base model, too (CodeLlama-34b-hf
). However, I did not encounter such an issue with any of the CodeLlama instruct series (with the same starting config).To make it easier to debug, I attached the complete log here (it is too big, so i have to upload it somewhere else).
The error log:
Here is the script and the docker container (with
vllm==0.1.5
) i used to spin up the server.