Closed erjieyong closed 3 weeks ago
mark
+1. And the HTTP status code is 200. It's weird.
+1
For my case, I've managed to identify that the issue is likely not due to vllm. My reason because i am able to call my backend api using the server ip address without any aborted request. In comparison, if i were to call the backend api using my domain, i will face this aborted request issue.
My suspicion is that nginx is meddling with the configuration but have yet to find the fix yet.
Same here with vLLM v.0.3.0:
INFO 02-25 12:13:05 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.5%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:10 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.7%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:11 async_llm_engine.py:133] Aborted request cmpl-4ed3ef5554c44cd98158a194a4439744.
INFO 02-25 12:13:12 async_llm_engine.py:431] Received request cmpl-5ed32b9b9ff645718b4794cfe2cb37b0: prompt
Same here on vllm 0.3.3, when sending 4 request to a mistral instruct 7B v0.2 on a single nvidia A40. Occurs both with default swap-size and swap-size=0. Both vllm and client run on same host, and using loopback ip 127.0.0.1 to send requests.
Confirmed - this happens for me as well, with latest vllm. This time serving Mixtral 8x7b with AWQ. Doesn't happen for all prompts and it seems to happen more often when hitting completions
rather than chat/completions
.
Ok so -
@simon-mo just tagging you here to investigate.
cc @njhill
+1, the same here on vLLM 0.3.3 on request 105 (this time, it differs) to "BASE_URL/v1/chat/completions". Also got a 200 OK response and nothing to see in the vLLM log except: INFO 04-16 12:11:09 async_llm_engine.py:133] Aborted request cmpl-c039d6b5cc084840bab5a439c4e2144c.
I can make the request again and it works, it usually happens in the middle of a long running test. The model used is casperhansen/mixtral-instruct-awq
Any help is much appreciated.
Is there any indication of an exception in the response?
There are a few places in async_llm_engine.py
that call this method and exceptions are placed into the request stream but not logged out in the server.
Did some more testing and debugging, I can recreate the error quickly by running multiple prompts async. The calling code cannot receive any response body:
2024-04-17 14:11:30,767 [INFO] [_client.py:1729] HTTP Request: POST https://[BASE_URL]/v1/chat/completions "HTTP/1.1 200 OK"
2024-04-17 14:11:30,768 [DEBUG] [_trace.py:85] receive_response_body.started request=<Request [b'POST']>
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] receive_response_body.failed exception=ReadError(BrokenResourceError())
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] response_closed.started
2024-04-17 14:11:47,341 [DEBUG] [_trace.py:85] response_closed.complete
ConnectionResetError: [Errno 54] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
return await self._stream.receive(max_bytes=max_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 196, in receive
data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 138, in _call_sslobject_method
data = await self.transport_stream.receive()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
raise self._protocol.exception
anyio.BrokenResourceError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
yield
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in __aiter__
async for part in self._httpcore_stream:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 361, in __aiter__
async for part in self._stream:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 337, in __aiter__
raise exc
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 329, in __aiter__
async for chunk in self._connection._receive_response_body(**kwargs):
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 198, in _receive_response_body
event = await self._receive_event(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 212, in _receive_event
data = await self._network_stream.read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
with map_exceptions(exc_map):
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 125, in _request
async for line in response.aiter_lines():
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 966, in aiter_lines
async for text in self.aiter_text():
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 953, in aiter_text
async for byte_content in self.aiter_bytes():
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 932, in aiter_bytes
async for raw_bytes in self.aiter_raw():
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 990, in aiter_raw
async for raw_stream_bytes in self.stream:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_client.py", line 146, in __aiter__
async for chunk in self._stream:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 248, in __aiter__
with map_httpcore_exceptions():
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 147, in __acall__
async for event in handler_method(request):
File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 235, in create_message
async for token in stream:
File "/Users/sand/Capacit/seek/seek-api-telia/src/infrastructure/resolvers/prompt_resolver.py", line 73, in astreamer
async for message in generator:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 305, in astream
raise e
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 287, in astream
async for chunk in self._astream(
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_mistralai/chat_models.py", line 369, in _astream
async for chunk in await acompletion_with_retry(
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 261, in chat_stream
async for json_response in async_response:
File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 143, in _request
raise MistralException(f"Unexpected exception ({e.__class__.__name__}): {e}") from e
mistralai.exceptions.MistralException: Unexpected exception (ReadError):
500: Internal Server Error during event handling.
vLLM arguments:
INFO 04-17 12:21:55 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/root/mixtral-instruct-awq', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=24000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.75, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
I also got the same error. This is what I ran
python3 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model casperhansen/mixtral-instruct-awq \
--tensor-parallel-size 4 \
--enforce-eager \
--quantization awq \
--gpu-memory-utilization 0.96
Any solution?
Having the same issue with more than 100 concurrent requests at a time.
Running the OpenAI Compatible server and continuously hitting it with > 100 concurrent requests leads to many requests just simply getting request aborted with no specified reason. I have tried setting swap space to 0 to no avail.
No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?
It seems to be much worse when increasing the load. I don't think it's GPU memory cause I needed to reduce max-model-len to avoid getting out of memory errors running on 2 nvidia A10 GPUs (24 GB memory).
It is running very slowly (unusable I would say) when I run my load tests on it where I simulate up to 40 humans prompting at 40 words per minute.
If anyone knows approximately when vLLM is getting optimized for AWQ models and how big a perfomance increase is expected I would be very happy to know!
Let’s recap the issue discussion until now: It happens with different models. It happens with different GPUs. It happens with different quantization methods. It happens on high load.
@prakashsanker’s hypothesis is that this is an issue with the server and not the core engine. I believe he is right.
Maybe someone here is knowledgeable about asynchronous webserving in Python and has time to investigate?
I suppose the expected behaviour would be for the server to start returning 503 once it can’t handle more requests.
I think the problem I was running into maybe different from what you folks have. I figured out mine and it's more of a client configuration problem with HTTP calls to fastapi. I will draft up a PR and new issue specific for issue related to the one I ran into, and to be honest, it's more of a logging is not great when the request is aborted than anything else. I also noticed that if a request is aborted it seems like the LLM still tries to run the request anyway? I could be wrong
Running into the same problem, vllm will random abort some of requests even the batch number is relatively small, say 10. Anyone working on this?
Same problem, but I encountered it when the prompts are long(>16000 tokens) and the concurrent request number is relatively small, like 4. So what is the reason here, high workloads? Great thanks.
I got this in two different scenarios. It happened when load testing my setup with a 1000 concurrent requests. A bunch of requests got aborted after the first couple of hundreds. Another setup I saw this in is while running a keda http scaler addon proxy in front of a k8s service I had created for my vllm pods. In the second case, as much as a single request causes abortion. I am not sure if I have something wrong with my http setup, but thought it might be worth sharing.
@erjieyong I am facing the same issue, with nginx its failing with the same error, however without nginx, its working fine. Did you find any solution?
+1
+1 because I also get this for longer prompts. Confirmed that there's no error if the prompt it just a few words. Confirmed that this error occurs if prompt is about 500 words. Useing Qwen2 72B instruct on 8X3090.
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct --tensor-parallel-size 8 --max-model-len 12591 --swap 1
+1 because I also get this for longer prompts. Confirmed that there's no error if the prompt it just a few words. Confirmed that this error occurs if prompt is about 500 words. Useing Qwen2 72B instruct on 8X3090.
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct --tensor-parallel-size 8 --max-model-len 12591 --swap 1
Found the problem for my issue: the request time out was too short
set timeout larger will solve
no,timout=1 not solve.
I am experiencing a similar issue. When the prompt is short, the response is returned normally. However, if the prompt is relatively long, the server logs the following error: INFO 08-19 11:15:47 async_llm_engine.py:181] Aborted request chat-12fa08eea52d4308ab46efbdeb7a300a.
, and the client displays Error: read ECONNRESET
.
@carljones3000, have you found a solution to this problem? Is there a way to adjust the request timeout?
In my tests using Postman, where the timeout is disabled, the request still fails. This suggests that the timeout issue may be occurring on the server side.
After some attempts, I found a temporary solution to the issue caused by the timeout. You can use the stream method by setting "stream": true
. This allows the server to return the result continuously, preventing the request from being aborted.
For more information on using stream, refer to this example.
After some attempts, I found a temporary solution to the issue caused by the timeout. You can use the stream method by setting
"stream": true
. This allows the server to return the result continuously, preventing the request from being aborted.For more information on using stream, refer to this example.
We use "stream": true, still with this problem in vllm 0.5.2 We updated to vllm 0.5.4, the same errors occur randomly, with Qwen2 7B Instruct.
Still Error in 0.5.5, 0.6.0!!!!!!!!
Still Error in 0.6.0
Also getting this error
Also getting this error
could you report your version?
Version is 0.6.0, lowering my parallelism and deploying on more GPUs fixed things for me.
Same error, version 0.5.1
Still seeing this issue on 0.6.2. Works on some prompts when "stream": true
Works fine for very small prompts. But anything else, we see the "aborted request" INFO log.
GPU: 2xV100 Tesla
Command used to start the server
VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 VLLM_TRACE_FUNCTION=1 NCCL_DEBUG=TRACE NCCL_SHM_DISABLE=1 DISKCACHE_DIRECTORY=/data/ VLLM_CONFIG_ROOT=/data/ VLLM_CACHE_ROOT=/data/ XDG_CACHE_HOME=/data/ XDG_CONFIG_HOME=/data/ NUMBA_CACHE_DIR=/data/ OUTLINES_CACHE_DIR=/data/.outlines HF_HOME=/data/ TRANSFORMERS_CACHE=/data/ python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Nemo-Instruct-2407 --dtype=half --tensor-parallel-size 2 --max-model-len 1024 --enforce-eager --ssl-certfile $tls_cert --ssl-keyfile $tls_key --port 8080 --download-dir /data/ --trust-remote-code
Model - mistralai/Mistral-Nemo-Instruct-2407
We had a similar problem with long-running requests. In our case, it appears that each request is aborted almost exactly 5 minutes after it was added, e.g.:
...
INFO 10-09 11:52:56 async_llm_engine.py:208] Added request cmpl-e2fc958c1054455d9b2b6f8ce440a384-0.
...
INFO 10-09 11:57:57 async_llm_engine.py:220] Aborted request cmpl-e2fc958c1054455d9b2b6f8ce440a384-0.
...
(using the vLLM 0.5.5. Docker image)
This happens to coincide with the default torch distributed timeout, but explicitly specifying a timeout in init_process_group in vllm/distributed/parallel_state.py didn't help.
I also encountered this issue. It occurs when there are many requests, and the input length is relatively long(3000~4000tokens).
I'm having this problem even with a single request and a short prompt, if that ends up generating something above 585 tokens.
I tried different LLMs, and I ask them to generate a very long answer to some arbitrary question.
This is most likely a client-side timeout, or some proxy between your client and vLLM that has a connection timeout. The request will be cancelled with this message printed in the logs when the caller closes the connection. It happens for larger number of tokens because those are the requests that take longer.
This is most likely a client-side timeout, or some proxy between your client and vLLM that has a connection timeout. The request will be cancelled with this message printed in the logs when the caller closes the connection. It happens for larger number of tokens because those are the requests that take longer.
After further investigation, I can confirm that was the issue in my case. We were running vLLM (under kserve) in one kubernetes cluster and accessing it through LiteLLM, which ran on another cluster, and there was a policy to cap connections across clusters to 30 seconds. Thanks for answering!
Closing this since it appears to have been resolved / due to external causes.
Even if it is an external cause, I'm not sure this should be closed, as the messages logged are unclear. Would it be hard to somehow indicate in the log message that the request was aborted because the connection was closed and not due to some internal error in vLLM?
For me, with an ALB, i changed the following and one of these things fixed it:
One of these settings made our 504's go away. Perhaps someone can tell us which one 🦙
Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output.
However, normally only between 6 - 10 requests finished successfully. The remaining requests would normally just be displayed as "Aborted request" without any error logs or trace. GPU KV cache can be up to 95% for those sequences which finished successfully.
I have been trying to find ways to find out what's the cause of the abortion without success. Appreciate any help. Thanks!
My code to load the engine is as follows: