vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.97k stars 4.53k forks source link

Aborted request without reason #2484

Closed erjieyong closed 3 weeks ago

erjieyong commented 9 months ago

Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output.

However, normally only between 6 - 10 requests finished successfully. The remaining requests would normally just be displayed as "Aborted request" without any error logs or trace. GPU KV cache can be up to 95% for those sequences which finished successfully.

INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 0e89f1d2d94c4a039f868222c100cc8a.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request be67046b843244b5bf1ed3d2ff2f5a02.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request b532ed57647945869a4cae499fe54f23.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 6c56897bbc9d4a808b8e056c39baf91b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 75b645c69d7449509f68ca23b34f1048.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request eb87d6473a9d4b3699ca0cc236900248.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request ca15a251849c45329825ca95a2fce96b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request c42bbea2f781469e89e576f98e618243.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 14b94d4ffd6646d69d4c2ad36d7dfd50.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 83c7dd9cbe9d4f6481b26403f46f1731.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 3e98245d88534c53be230aa25c56d99a.
INFO 01-17 03:11:26 async_llm_engine.py:111] Finished request 49b84ef96af44f069056b2fc43526cdd.

I have been trying to find ways to find out what's the cause of the abortion without success. Appreciate any help. Thanks!

My code to load the engine is as follows:

engine_args = AsyncEngineArgs(
    model=LLM_BASEMODEL,
    tokenizer=LLM_BASEMODEL,
    tokenizer_mode="auto",
    trust_remote_code=True,
    download_dir=None,
    load_format="auto",
    dtype="auto",
    seed=0,
    max_model_len=None,
    worker_use_ray=False,
    pipeline_parallel_size=1,
    tensor_parallel_size=1,
    max_parallel_loading_workers=None,
    block_size=16,
    swap_space=4,
    gpu_memory_utilization=0.9,
    max_num_batched_tokens=None,
    max_num_seqs=256,
    max_paddings=256,
    disable_log_stats=False,
    revision=None,
    tokenizer_revision=None,
    quantization="gptq",
    enforce_eager=False,
    max_context_len_to_capture=8192,
    engine_use_ray=False,
    disable_log_requests=False,
    max_log_len=None,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
ARayOfsunshine commented 9 months ago

mark

HermitSun commented 9 months ago

+1. And the HTTP status code is 200. It's weird.

CoLdIcE42 commented 9 months ago

+1

erjieyong commented 9 months ago

For my case, I've managed to identify that the issue is likely not due to vllm. My reason because i am able to call my backend api using the server ip address without any aborted request. In comparison, if i were to call the backend api using my domain, i will face this aborted request issue.

My suspicion is that nginx is meddling with the configuration but have yet to find the fix yet.

vecorro commented 8 months ago

Same here with vLLM v.0.3.0:

INFO 02-25 12:13:05 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.5%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:10 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.7%, CPU KV cache usage: 0.0%
INFO 02-25 12:13:11 async_llm_engine.py:133] Aborted request cmpl-4ed3ef5554c44cd98158a194a4439744.
INFO 02-25 12:13:12 async_llm_engine.py:431] Received request cmpl-5ed32b9b9ff645718b4794cfe2cb37b0: prompt
ghpu commented 8 months ago

Same here on vllm 0.3.3, when sending 4 request to a mistral instruct 7B v0.2 on a single nvidia A40. Occurs both with default swap-size and swap-size=0. Both vllm and client run on same host, and using loopback ip 127.0.0.1 to send requests.

prakashsanker commented 7 months ago

Confirmed - this happens for me as well, with latest vllm. This time serving Mixtral 8x7b with AWQ. Doesn't happen for all prompts and it seems to happen more often when hitting completions rather than chat/completions.

prakashsanker commented 7 months ago

Ok so -

  1. This is a clear problem with the implementation of the inference server, rather than the core vllm engine. I had a prompt that was consistently failing when hitting the open ai compatible server.
  2. I copied/pasted the prompt and ran it with offline inference --> the prompt worked.
prakashsanker commented 7 months ago

@simon-mo just tagging you here to investigate.

simon-mo commented 7 months ago

cc @njhill

ptsand commented 6 months ago

+1, the same here on vLLM 0.3.3 on request 105 (this time, it differs) to "BASE_URL/v1/chat/completions". Also got a 200 OK response and nothing to see in the vLLM log except: INFO 04-16 12:11:09 async_llm_engine.py:133] Aborted request cmpl-c039d6b5cc084840bab5a439c4e2144c.

I can make the request again and it works, it usually happens in the middle of a long running test. The model used is casperhansen/mixtral-instruct-awq

Any help is much appreciated.

br3no commented 6 months ago

Is there any indication of an exception in the response?

There are a few places in async_llm_engine.py that call this method and exceptions are placed into the request stream but not logged out in the server.

ptsand commented 6 months ago

Did some more testing and debugging, I can recreate the error quickly by running multiple prompts async. The calling code cannot receive any response body:

2024-04-17 14:11:30,767 [INFO] [_client.py:1729] HTTP Request: POST https://[BASE_URL]/v1/chat/completions "HTTP/1.1 200 OK"
2024-04-17 14:11:30,768 [DEBUG] [_trace.py:85] receive_response_body.started request=<Request [b'POST']>
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] receive_response_body.failed exception=ReadError(BrokenResourceError())
2024-04-17 14:11:47,340 [DEBUG] [_trace.py:85] response_closed.started
2024-04-17 14:11:47,341 [DEBUG] [_trace.py:85] response_closed.complete
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 196, in receive
    data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/streams/tls.py", line 138, in _call_sslobject_method
    data = await self.transport_stream.receive()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1212, in receive
    raise self._protocol.exception
anyio.BrokenResourceError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 66, in map_httpcore_exceptions
    yield
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in __aiter__
    async for part in self._httpcore_stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 361, in __aiter__
    async for part in self._stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 337, in __aiter__
    raise exc
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 329, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 198, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_async/http11.py", line 212, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 31, in read
    with map_exceptions(exc_map):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 125, in _request
    async for line in response.aiter_lines():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 966, in aiter_lines
    async for text in self.aiter_text():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 953, in aiter_text
    async for byte_content in self.aiter_bytes():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 932, in aiter_bytes
    async for raw_bytes in self.aiter_raw():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_models.py", line 990, in aiter_raw
    async for raw_stream_bytes in self.stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_client.py", line 146, in __aiter__
    async for chunk in self._stream:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 248, in __aiter__
    with map_httpcore_exceptions():
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 147, in __acall__
    async for event in handler_method(request):
  File "/Users/sand/Capacit/seek/seek-api-telia/src/api/controllers/v1/conversation.py", line 235, in create_message
    async for token in stream:
  File "/Users/sand/Capacit/seek/seek-api-telia/src/infrastructure/resolvers/prompt_resolver.py", line 73, in astreamer
    async for message in generator:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 305, in astream
    raise e
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 287, in astream
    async for chunk in self._astream(
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/langchain_mistralai/chat_models.py", line 369, in _astream
    async for chunk in await acompletion_with_retry(
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 261, in chat_stream
    async for json_response in async_response:
  File "/opt/homebrew/anaconda3/envs/seek/lib/python3.11/site-packages/mistralai/async_client.py", line 143, in _request
    raise MistralException(f"Unexpected exception ({e.__class__.__name__}): {e}") from e
mistralai.exceptions.MistralException: Unexpected exception (ReadError):

500: Internal Server Error during event handling.

vLLM arguments:

INFO 04-17 12:21:55 api_server.py:228] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/root/mixtral-instruct-awq', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=24000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.75, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
cocoza4 commented 6 months ago

I also got the same error. This is what I ran

python3 -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model casperhansen/mixtral-instruct-awq \
       --tensor-parallel-size 4 \
      --enforce-eager \
      --quantization awq \
      --gpu-memory-utilization 0.96

Any solution?

davidgxue commented 6 months ago

Having the same issue with more than 100 concurrent requests at a time.

Running the OpenAI Compatible server and continuously hitting it with > 100 concurrent requests leads to many requests just simply getting request aborted with no specified reason. I have tried setting swap space to 0 to no avail.

ArnaudHureaux commented 6 months ago

No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?

ptsand commented 6 months ago

It seems to be much worse when increasing the load. I don't think it's GPU memory cause I needed to reduce max-model-len to avoid getting out of memory errors running on 2 nvidia A10 GPUs (24 GB memory).

It is running very slowly (unusable I would say) when I run my load tests on it where I simulate up to 40 humans prompting at 40 words per minute.

If anyone knows approximately when vLLM is getting optimized for AWQ models and how big a perfomance increase is expected I would be very happy to know!

br3no commented 6 months ago

Let’s recap the issue discussion until now: It happens with different models. It happens with different GPUs. It happens with different quantization methods. It happens on high load.

@prakashsanker’s hypothesis is that this is an issue with the server and not the core engine. I believe he is right.

Maybe someone here is knowledgeable about asynchronous webserving in Python and has time to investigate?

I suppose the expected behaviour would be for the server to start returning 503 once it can’t handle more requests.

davidgxue commented 6 months ago

I think the problem I was running into maybe different from what you folks have. I figured out mine and it's more of a client configuration problem with HTTP calls to fastapi. I will draft up a PR and new issue specific for issue related to the one I ran into, and to be honest, it's more of a logging is not great when the request is aborted than anything else. I also noticed that if a request is aborted it seems like the LLM still tries to run the request anyway? I could be wrong

Penglikai commented 6 months ago

Running into the same problem, vllm will random abort some of requests even the batch number is relatively small, say 10. Anyone working on this?

senbinyu commented 5 months ago

Same problem, but I encountered it when the prompts are long(>16000 tokens) and the concurrent request number is relatively small, like 4. So what is the reason here, high workloads? Great thanks.

yosefmih commented 4 months ago

I got this in two different scenarios. It happened when load testing my setup with a 1000 concurrent requests. A bunch of requests got aborted after the first couple of hundreds. Another setup I saw this in is while running a keda http scaler addon proxy in front of a k8s service I had created for my vllm pods. In the second case, as much as a single request causes abortion. I am not sure if I have something wrong with my http setup, but thought it might be worth sharing.

MBhartiya commented 4 months ago

@erjieyong I am facing the same issue, with nginx its failing with the same error, however without nginx, its working fine. Did you find any solution?

ayrnb commented 3 months ago

+1

carljones3000 commented 3 months ago

+1 because I also get this for longer prompts. Confirmed that there's no error if the prompt it just a few words. Confirmed that this error occurs if prompt is about 500 words. Useing Qwen2 72B instruct on 8X3090.

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct --tensor-parallel-size 8 --max-model-len 12591 --swap 1

carljones3000 commented 3 months ago

+1 because I also get this for longer prompts. Confirmed that there's no error if the prompt it just a few words. Confirmed that this error occurs if prompt is about 500 words. Useing Qwen2 72B instruct on 8X3090.

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct --tensor-parallel-size 8 --max-model-len 12591 --swap 1

Found the problem for my issue: the request time out was too short

ayrnb commented 3 months ago

set timeout larger will solve image

coderchem commented 3 months ago

no,timout=1 not solve.

46319943 commented 2 months ago

I am experiencing a similar issue. When the prompt is short, the response is returned normally. However, if the prompt is relatively long, the server logs the following error: INFO 08-19 11:15:47 async_llm_engine.py:181] Aborted request chat-12fa08eea52d4308ab46efbdeb7a300a., and the client displays Error: read ECONNRESET.

@carljones3000, have you found a solution to this problem? Is there a way to adjust the request timeout?

In my tests using Postman, where the timeout is disabled, the request still fails. This suggests that the timeout issue may be occurring on the server side.

46319943 commented 2 months ago

After some attempts, I found a temporary solution to the issue caused by the timeout. You can use the stream method by setting "stream": true. This allows the server to return the result continuously, preventing the request from being aborted.

For more information on using stream, refer to this example.

rocke2020 commented 2 months ago

After some attempts, I found a temporary solution to the issue caused by the timeout. You can use the stream method by setting "stream": true. This allows the server to return the result continuously, preventing the request from being aborted.

For more information on using stream, refer to this example.

We use "stream": true, still with this problem in vllm 0.5.2 We updated to vllm 0.5.4, the same errors occur randomly, with Qwen2 7B Instruct.

TangJiakai commented 2 months ago

Still Error in 0.5.5, 0.6.0!!!!!!!!

cicicji commented 2 months ago

Still Error in 0.6.0

kjain14 commented 2 months ago

Also getting this error

rocke2020 commented 1 month ago

Also getting this error

could you report your version?

kjain14 commented 1 month ago

Version is 0.6.0, lowering my parallelism and deploying on more GPUs fixed things for me.

naturomics commented 1 month ago

Same error, version 0.5.1

nikhil05k commented 1 month ago

Still seeing this issue on 0.6.2. Works on some prompts when "stream": true

Works fine for very small prompts. But anything else, we see the "aborted request" INFO log.

GPU: 2xV100 Tesla

Command used to start the server

VLLM_LOGGING_LEVEL=DEBUG CUDA_LAUNCH_BLOCKING=1 VLLM_TRACE_FUNCTION=1 NCCL_DEBUG=TRACE NCCL_SHM_DISABLE=1 DISKCACHE_DIRECTORY=/data/ VLLM_CONFIG_ROOT=/data/ VLLM_CACHE_ROOT=/data/ XDG_CACHE_HOME=/data/ XDG_CONFIG_HOME=/data/ NUMBA_CACHE_DIR=/data/ OUTLINES_CACHE_DIR=/data/.outlines HF_HOME=/data/ TRANSFORMERS_CACHE=/data/ python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-Nemo-Instruct-2407 --dtype=half --tensor-parallel-size 2 --max-model-len 1024 --enforce-eager --ssl-certfile $tls_cert --ssl-keyfile $tls_key --port 8080 --download-dir /data/ --trust-remote-code

Model - mistralai/Mistral-Nemo-Instruct-2407

stadlerb commented 1 month ago

We had a similar problem with long-running requests. In our case, it appears that each request is aborted almost exactly 5 minutes after it was added, e.g.:

...
INFO 10-09 11:52:56 async_llm_engine.py:208] Added request cmpl-e2fc958c1054455d9b2b6f8ce440a384-0.
...
INFO 10-09 11:57:57 async_llm_engine.py:220] Aborted request cmpl-e2fc958c1054455d9b2b6f8ce440a384-0.
...

(using the vLLM 0.5.5. Docker image)

This happens to coincide with the default torch distributed timeout, but explicitly specifying a timeout in init_process_group in vllm/distributed/parallel_state.py didn't help.

wfnian commented 4 weeks ago

I also encountered this issue. It occurs when there are many requests, and the input length is relatively long(3000~4000tokens).

erickrf commented 4 weeks ago

I'm having this problem even with a single request and a short prompt, if that ends up generating something above 585 tokens.

I tried different LLMs, and I ask them to generate a very long answer to some arbitrary question.

njhill commented 3 weeks ago

This is most likely a client-side timeout, or some proxy between your client and vLLM that has a connection timeout. The request will be cancelled with this message printed in the logs when the caller closes the connection. It happens for larger number of tokens because those are the requests that take longer.

erickrf commented 3 weeks ago

This is most likely a client-side timeout, or some proxy between your client and vLLM that has a connection timeout. The request will be cancelled with this message printed in the logs when the caller closes the connection. It happens for larger number of tokens because those are the requests that take longer.

After further investigation, I can confirm that was the issue in my case. We were running vLLM (under kserve) in one kubernetes cluster and accessing it through LiteLLM, which ran on another cluster, and there was a policy to cap connections across clusters to 30 seconds. Thanks for answering!

njhill commented 3 weeks ago

Closing this since it appears to have been resolved / due to external causes.

stadlerb commented 1 week ago

Even if it is an external cause, I'm not sure this should be closed, as the messages logged are unclear. Would it be hard to somehow indicate in the log message that the request was aborted because the connection was closed and not due to some internal error in vLLM?

danktec commented 1 week ago

For me, with an ALB, i changed the following and one of these things fixed it:

One of these settings made our 504's go away. Perhaps someone can tell us which one 🦙