intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.75k stars 1.27k forks source link

Glm4-9b-inference输出错误ISSUE #12146

Open jessie-zhao opened 2 months ago

jessie-zhao commented 2 months ago

用以下方式验证glm4-9b-chat模型的输出,serving端报错

curl --request POST \ --url http://127.0.0.1:8000/v1/chat/completions \ --header 'content-type: application/json' \ --data '{ "model": "glm-4-9b-chat", "temperature": 0.7, "top_p": 0.8, "messages": [ { "role": "system", "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n" }, { "role": "user", "content": "你是谁" } ], "max_tokens": 1024, "repetition_penalty": 1.0 }'

Serving端启动脚本

!/bin/bash

model="/llm/models/glm-4-9b-chat" served_model_name="glm-4-9b-chat"

export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_CACHE_PERSISTENT=1 export TORCH_LLM_ALLREDUCE=0 export CCL_DG2_ALLREDUCE=1

Tensor parallel related arguments:

export CCL_WORKER_COUNT=1 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 source /opt/intel/oneapi/setvars.sh source /opt/intel/1ccl-wks/setvars.sh python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --device xpu \ --dtype float16 \ --enforce-eager \ --load-in-low-bit fp8 \ --max-model-len 2048 \ --max-num-batched-tokens 4000 \ --tensor-parallel-size 1

Serving端报错: INFO: 127.0.0.1:58348 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/applications.py", line 113, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 187, in call raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 165, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 62, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 715, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 735, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 288, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 76, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 73, in app response = await f(request) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 301, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function return await dependant.call(values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 191, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 132, in create_chat_completion prompt_inputs = self._tokenize_prompt_input( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 291, in _tokenize_prompt_input return next( ^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 314, in _tokenize_prompt_inputs yield self._normalize_prompt_text_to_input( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/entrypoints/openai/serving_engine.py", line 206, in _normalize_prompt_text_to_input encoded = tokenizer(prompt, add_special_tokens=add_special_tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3024, in call encodings = self._call_one(text=text, text_pair=text_pair, all_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3134, in _call_one return self.encode_plus( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3210, in encode_plus return self._encode_plus( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils.py", line 801, in _encode_plus return self.prepare_for_model( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3706, in prepare_for_model encoded_inputs = self.pad( ^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 3508, in pad encoded_inputs = self._pad( ^^^^^^^^^^ TypeError: ChatGLM4Tokenizer._pad() got an unexpected keyword argument 'padding_side'

hzjane commented 2 months ago

Refer to this issue. It seems that the transformers version 4.45.0 will meet this issue when running glm model. You can use transformers 4.37.0 to run first.

pip install transformers==4.37.0