intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.5k stars 1.24k forks source link

Running vLLM service benchmark(1xARC770) with Qwen1.5-14B-Chat model failed(compression weight:SYM_INT4). #12087

Open dukelee111 opened 1 day ago

dukelee111 commented 1 day ago

Environment: Platform: 6548N+1 ARC770 Docker Image: image servicing script: image

Error info: 1.With compression weight SYM_INT4 failed. 2.Has tried the parameter "gpu-memory-utilization" from 0.65 to 0.95 with step 0.05 could not work.

Error log: 1.Servicing side error log: image

hzjane commented 1 day ago

I didn't reproduce this error. Did you encounter this issue when you started vllm? Or when you benchmarked?

dukelee111 commented 1 day ago

It's not encountered when doing benchmark, starting vllm could succeed.

ACupofAir commented 1 day ago

Cannot reproduce Steps:

  1. start docker:
    
    #!/bin/bash
    export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
    export CONTAINER_NAME=junwang-vllm54-issue220

docker rm -f $CONTAINER_NAME sudo docker run -itd \ --net=host \ --device=/dev/dri \ --name=$CONTAINER_NAME \ -v /home/intel/LLM:/llm/models/ \ -v /home/intel/junwang:/workspace \ -e no_proxy=localhost,127.0.0.1 \ --shm-size="16g" \ $DOCKER_IMAGE

2. start serve:
```bash
#!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat/"
served_model_name="Qwen1.5-14B-Chat"

export no_proxy=localhost,127.0.0.1

source /opt/intel/oneapi/setvars.sh
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8001 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  -tp 1 \
  --max-num-seqs 64
  #-tp 2 #--enable-prefix-caching --enable-chunked-prefill #--tokenizer-pool-size 8 --swap-space 8
  1. curl script:

    curl http://localhost:8001/v1/completions                 -H "Content-Type: application/json"             -d '{
                  "model": "Qwen1.5-14B-Chat",
                  "prompt": "San Francisco is a",
                  "max_tokens": 128
    }'
  2. result

    1. offline image

    2. online image