Open dukelee111 opened 1 day ago
I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?
It's not encountered when doing benchmark, starting vllm could succeed.
Cannot reproduce Steps:
#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220
docker rm -f $CONTAINER_NAME sudo docker run -itd \ --net=host \ --device=/dev/dri \ --name=$CONTAINER_NAME \ -v /home/intel/LLM:/llm/models/ \ -v /home/intel/junwang:/workspace \ -e no_proxy=localhost,127.0.0.1 \ --shm-size="16g" \ $DOCKER_IMAGE
2. start serve:
```bash
#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"
export no_proxy=localhost,127.0.0.1
source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8001 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4096 \
--max-num-seqs 256 \
-tp 1 \
--max-num-seqs 64
#-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8
curl script:
curl http://localhost:8001/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen1.5-14B-Chat",
"prompt": "San Francisco is a",
"max_tokens": 128
}'
result
offline
online
Environment: Platform: 6548N+1 ARC770 Docker Image: servicing script:
Error info: 1.With compression weight SYM_INT4 failed. 2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.
Error log: 1.Servicing side error log: