Closed shanshanpt closed 8 months ago
llama model max_model_len is 2048, so I modify the max_model_len to 60000 by force in 'vllm/config.py'.
It happen to me too when i tried to apply Dynamic-NTK rope scaling.
Environment:
Hardware: single A100-80GB
Model: Falcon-7b
Rope scaling in config.json is:
"max_position_embeddings": 2048 rope_scaling": { "factor": 4.0, "type": "dynamic" },
vLLM version 0.2.1.post1, pytorch==2.0.1 cu117
Error show as bellow:
.........312, 402, 2101, 272, 248, 2132, 4436, 25, 193].
INFO 11-19 21:25:21 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f0db2301990>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0da2545ff0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f0db2301990>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0da2545ff0>)>
Traceback (most recent call last):
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 330, in engine_step
request_outputs = await self.engine.step_async()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 191, in step_async
output = await self._run_workers_async(
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 216, in _run_workers_async
output = executor(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 323, in execute_model
output = self.model(
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/models/falcon.py", line 413, in forward
next_tokens = self.sampler(self.lm_head.weight, hidden_states,
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 44, in forward
hidden_states = _prune_hidden_states(hidden_states, input_metadata)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 129, in _prune_hidden_states
selected_token_indices = torch.tensor(selected_token_indices,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 11-19 21:25:24 async_llm_engine.py:134] Aborted request cmpl-c009662d0f1d48e7a8ff8fb0cb9f0135.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 330, in engine_step
request_outputs = await self.engine.step_async()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 191, in step_async
output = await self._run_workers_async(
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 216, in _run_workers_async
output = executor(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 323, in execute_model
output = self.model(
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/models/falcon.py", line 413, in forward
next_tokens = self.sampler(self.lm_head.weight, hidden_states,
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 44, in forward
hidden_states = _prune_hidden_states(hidden_states, input_metadata)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 129, in _prune_hidden_states
selected_token_indices = torch.tensor(selected_token_indices,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
await super().__call__(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
await func()
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 475, in completion_stream_generator
async for res in result_generator:
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 436, in generate
raise e
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 430, in generate
async for request_output in stream:
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 70, in __anext__
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Long text cannot be used. I have encountered the same problem as them, which is very serious. Please help me solve it
Side note: I tried with HF transformers and for single A100 80GB it enough to make 12k tokens inference with falcon-7b. But when I tried with vLLM, Iam only use 4k token prompt which is much smaller and should be fit in 80GB GPU ram. So it not a problem of OOM here.
prompt len: 6495, max_tokens: 21000 running command :
python benchmark_serving.py --backend=vllm --host=localhost --port=8888 --dataset=/mnt/vllm/benchmarks/fake_data --tokenizer=/mnt/disk2/lama-tokenizer --num-prompts=1
python -m vllm.entrypoints.api_server --model=/mnt/disk2/llama-2-13b-chat-hf/ --tokenizer=/mnt/disk2/lama-tokenizer --tensor-parallel-size=2 --swap-space=64 --engine-use-ray --worker-use-ray --max-num-batched-tokens=60000