Error: 'Not enough data for satisfy transfer length header.' during vllm serving on Arc

Scripts and Params

# start server
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server --served-model-name qwen-32b --port 8000 --model /opt/Qwen1.5-32B-Chat --trust-remote-code --gpu-memory-utilization 0.8 --device xpu --dtype float16 --enforce-eager --load-in-low-bit sym_int4 --max-model-len 9408 --max-num-batched-tokens 9408 --tensor-parallel-size 4

# start client
python benchmark_serving.py --model qwen-32b --tokenizer /opt/Qwen1.5-32B-Chat --random-input-len 7408 --random-output-len 900 --num-prompts 1 --dataset-name random

Server API Error

ERROR 08-16 18:23:54 async_llm_engine.py:470] Engine iteration timed out. This should never happen!
ERROR 08-16 18:23:54 async_llm_engine.py:41] Engine background task failed
ERROR 08-16 18:23:54 async_llm_engine.py:41] Traceback (most recent call last):
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
ERROR 08-16 18:23:54 async_llm_engine.py:41]     return fut.result()
ERROR 08-16 18:23:54 async_llm_engine.py:41]            ^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 441, in engine_step
ERROR 08-16 18:23:54 async_llm_engine.py:41]     request_outputs = await self.engine.step_async()
ERROR 08-16 18:23:54 async_llm_engine.py:41]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 211, in step_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     output = await self.model_executor.execute_model_async(
ERROR 08-16 18:23:54 async_llm_engine.py:41]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     all_outputs = await self._run_workers_async(
ERROR 08-16 18:23:54 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     all_outputs = await asyncio.gather(*coros)
ERROR 08-16 18:23:54 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41] asyncio.exceptions.CancelledError
ERROR 08-16 18:23:54 async_llm_engine.py:41]
ERROR 08-16 18:23:54 async_llm_engine.py:41] The above exception was the direct cause of the following exception:
ERROR 08-16 18:23:54 async_llm_engine.py:41]
ERROR 08-16 18:23:54 async_llm_engine.py:41] Traceback (most recent call last):
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
ERROR 08-16 18:23:54 async_llm_engine.py:41]     task.result()
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop
ERROR 08-16 18:23:54 async_llm_engine.py:41]     has_requests_in_progress = await asyncio.wait_for(
ERROR 08-16 18:23:54 async_llm_engine.py:41]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
ERROR 08-16 18:23:54 async_llm_engine.py:41]     raise exceptions.TimeoutError() from exc
ERROR 08-16 18:23:54 async_llm_engine.py:41] TimeoutError
2024-08-16 18:23:54,093 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f42b3835b20>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f42aa1a07d0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f42b3835b20>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f42aa1a07d0>>)>

Client side error log

ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/client_proto.py", line 94, in connection_lost
    uncompleted = self._parser.feed_eof()
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp/_http_parser.pyx", line 507, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
  Not enough data for satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/llm/backend_request_func.py", line 256, in async_request_openai_completions
    async for chunk_bytes in response.content:
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 50, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 317, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 351, in readuntil
    await self._wait("readuntil")
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 312, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>

intel-analytics / ipex-llm

Error: 'Not enough data for satisfy transfer length header.' during vllm serving on Arc #11826

Scripts and Params

Server API Error

Client side error log