Closed pseudotensor closed 7 months ago
When I try to use a different model, I get other errors:
export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git --upgrade
export CUDA_VISIBLE_DEVICES=1
python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model casperhansen/mixtral-instruct-awq --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100
INFO: 52.0.25.199:53634 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
await self.asgi_callable(scope, receive, wrapped_send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
await self.middleware_stack(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
await route.handle(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
raise e
File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/ephemeral/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in create_completion
return JSONResponse(content=generator.model_dump(),
AttributeError: 'ErrorResponse' object has no attribute 'model_dump'
Same exception for https://huggingface.co/TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ
INFO 01-27 01:46:22 llm_engine.py:316] # GPU blocks: 18691, # CPU blocks: 2048
INFO 01-27 01:46:23 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-27 01:46:23 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decreas>
INFO 01-27 01:46:36 model_runner.py:689] Graph capturing finished in 13 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-27 01:46:36 serving_chat.py:260] Using default chat template:
INFO 01-27 01:46:36 serving_chat.py:260] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 01-27 01:46:36 serving_chat.py:260] ' + message['content'] + '<|im_end|>' + '
INFO 01-27 01:46:36 serving_chat.py:260] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 01-27 01:46:36 serving_chat.py:260] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Started server process [277174]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit)
INFO 01-27 01:46:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:46:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:22 async_llm_engine.py:433] Received request cmpl-f9ebefcdb0b940ff9edb641a0cc03634-0: prompt: None, prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.14000000000000012, frequency_penalty=0.0, repetition_penalt>
INFO: 24.4.148.180:50012 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-27 01:51:22 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in wrap
await func()
File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
message = await receive()
File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
await self.message_event.wait()
File "/ephemeral/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f318c3b1900
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
On 0.2.7 release, never get any answer back even if seems to generate.
TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ on 0.2.7 just hangs in generation. never shows any output.
https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ/discussions/3
When I try to use a different model, I get other errors:
export CUDA_HOME=/usr/local/cuda-12.3 export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123" pip install git+https://github.com/vllm-project/vllm.git --upgrade export CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model casperhansen/mixtral-instruct-awq --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100
INFO: 52.0.25.199:53634 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi result = await app( # type: ignore[func-returns-value] File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__ return await self.app(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__ await self.app(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__ await self.asgi_callable(scope, receive, wrapped_send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__ await self.middleware_stack(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app await route.handle(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app raise e File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app raw_response = await run_endpoint_function( File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/ephemeral/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in create_completion return JSONResponse(content=generator.model_dump(), AttributeError: 'ErrorResponse' object has no attribute 'model_dump'
Encountered the same problem,i use Qwen/Qwen-7B-Chat model。It seems that vllm.entrypoints.openai.api_server.py is not compatible。 When the service is first started, there is also an error:AttributeError: 'AsyncLLMEngine' object has no attribute 'do_log_stats',I have already used the latest 0.2.7 release
The error
AttributeError: 'ErrorResponse' object has no attribute 'model_dump'
seems related to the version of pydantic
which was recently upgraded #2531
@pseudotensor are you still experiencing this issue?
Not lately. Can close.
@pseudotensor When using Qwen1.5 awq model, I still have this problem.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f8387440e50
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
doesn't seem to be model nor quantization specific, got the same error now using TechxGenus/starcoder2-7b-GPTQ
.
I'm using the latest vllm docker image with the following arguments:
--model "TechxGenus/starcoder2-7b-GPTQ" --revision 48dc06e6a6df8a8e8567694ead23f59204fa0d26 -q marlin --enable-chunked-prefill --max-model-len 4096 --gpu-memory-utilization 0.3
last week it still worked, so not sure what the reason is.
tgi-scripts-vllm-starcoder-2-7b-gptq-1 | INFO 05-28 09:47:25 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='TechxGenus/starcoder2-7b-GPTQ', speculative_config=None, tokenizer='TechxGenus/starcoder2-7b-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=48dc06e6a6df8a8e8567694ead23f59204fa0d26, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TechxGenus/starcoder2-7b-GPTQ)
FWIW, I had this CancelledError: Cancelled by cancel scope
error when sending several concurrent requests, and I solved it by changing --gpu-memory-utilization 0.99
to --gpu-memory-utilization 0.96
, so it might have to do with some sort of silent allocation failure that is helped by having a bit of a "buffer". In my case I had two 4090s on Runpod using vllm/vllm-openai:v0.4.3
, running a 70B GPTQ model. These were the other command line options that I had:
--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2 --max-model-len 4096 --kv-cache-dtype fp8
@hmellor @pseudotensor Maybe this could this be re-opened for the benefit of others, since two others have recently commented with similar issues since it was closed.
Potentially related:
The same generic error in starlette does not mean the problem is the same, especially when the original error was reported 4 months ago using a (now) old version of vLLM.
The different traces shared in the thread comments are not that similar.
If someone is experiencing the error they should open an issue with instructions on how to reproduce the error.
Any where, even simple, leads to: