Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4).

dukelee111 commented 1 day ago

Environment: Platform: 6548N+1 ARC770 Docker Image: servicing script:

Error info: 1.With compression weight SYM_INT4 failed. 2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.

Error log: 1.Servicing side error log:

hzjane commented 1 day ago

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?

dukelee111 commented 1 day ago

It's not encountered when doing benchmark, starting vllm could succeed.

ACupofAir commented 1 day ago

Cannot reproduce Steps:

start docker:


#!/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
export CONTAINER_NAME=junwang-vllm54-issue220

docker rm -f $CONTAINER_NAME sudo docker run -itd \ --net=host \ --device=/dev/dri \ --name=$CONTAINER_NAME \ -v /home/intel/LLM:/llm/models/ \ -v /home/intel/junwang:/workspace \ -e no_proxy=localhost,127.0.0.1 \ --shm-size="16g" \ $DOCKER_IMAGE

2. start serve:
```bash
#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"

export no_proxy=localhost,127.0.0.1

source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8001 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  -tp 1 \
  --max-num-seqs 64
  #-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8

curl script:

curl http://localhost:8001/v1/completions                 -H "Content-Type: application/json"             -d '{
              "model": "Qwen1.5-14B-Chat",
              "prompt": "San Francisco is a",
              "max_tokens": 128
}'

result
1. offline
2. online

intel-analytics / ipex-llm

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087