intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

Error: 'Not enough data for satisfy transfer length header.' during vllm serving on Arc #11826

Open aprilhu01 opened 3 weeks ago

aprilhu01 commented 3 weeks ago

Scripts and Params

# start server
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server --served-model-name qwen-32b --port 8000 --model /opt/Qwen1.5-32B-Chat --trust-remote-code --gpu-memory-utilization 0.8 --device xpu --dtype float16 --enforce-eager --load-in-low-bit sym_int4 --max-model-len 9408 --max-num-batched-tokens 9408 --tensor-parallel-size 4

# start client
python benchmark_serving.py --model qwen-32b --tokenizer /opt/Qwen1.5-32B-Chat --random-input-len 7408 --random-output-len 900 --num-prompts 1 --dataset-name random

Server API Error

ERROR 08-16 18:23:54 async_llm_engine.py:470] Engine iteration timed out. This should never happen!
ERROR 08-16 18:23:54 async_llm_engine.py:41] Engine background task failed
ERROR 08-16 18:23:54 async_llm_engine.py:41] Traceback (most recent call last):
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
ERROR 08-16 18:23:54 async_llm_engine.py:41]     return fut.result()
ERROR 08-16 18:23:54 async_llm_engine.py:41]            ^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 441, in engine_step
ERROR 08-16 18:23:54 async_llm_engine.py:41]     request_outputs = await self.engine.step_async()
ERROR 08-16 18:23:54 async_llm_engine.py:41]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 211, in step_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     output = await self.model_executor.execute_model_async(
ERROR 08-16 18:23:54 async_llm_engine.py:41]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     all_outputs = await self._run_workers_async(
ERROR 08-16 18:23:54 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async
ERROR 08-16 18:23:54 async_llm_engine.py:41]     all_outputs = await asyncio.gather(*coros)
ERROR 08-16 18:23:54 async_llm_engine.py:41]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41] asyncio.exceptions.CancelledError
ERROR 08-16 18:23:54 async_llm_engine.py:41]
ERROR 08-16 18:23:54 async_llm_engine.py:41] The above exception was the direct cause of the following exception:
ERROR 08-16 18:23:54 async_llm_engine.py:41]
ERROR 08-16 18:23:54 async_llm_engine.py:41] Traceback (most recent call last):
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
ERROR 08-16 18:23:54 async_llm_engine.py:41]     task.result()
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/llm/vllm/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop
ERROR 08-16 18:23:54 async_llm_engine.py:41]     has_requests_in_progress = await asyncio.wait_for(
ERROR 08-16 18:23:54 async_llm_engine.py:41]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-16 18:23:54 async_llm_engine.py:41]   File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
ERROR 08-16 18:23:54 async_llm_engine.py:41]     raise exceptions.TimeoutError() from exc
ERROR 08-16 18:23:54 async_llm_engine.py:41] TimeoutError
2024-08-16 18:23:54,093 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f42b3835b20>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f42aa1a07d0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f42b3835b20>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f42aa1a07d0>>)>

Client side error log

ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/client_proto.py", line 94, in connection_lost
    uncompleted = self._parser.feed_eof()
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp/_http_parser.pyx", line 507, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
  Not enough data for satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/llm/backend_request_func.py", line 256, in async_request_openai_completions
    async for chunk_bytes in response.content:
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 50, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 317, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 351, in readuntil
    await self._wait("readuntil")
  File "/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py", line 312, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>
gc-fu commented 3 weeks ago

This problem is due to oom, try reduce gpu-utilization-rate, or max-num-batched-tokens

aprilhu01 commented 3 weeks ago

Thanks. It is indeed due to OOM. Will adopt "enable split_qkv" patch and try again.