intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

failure to launch codegeex4-all-9b Using vllm #11910

Open YongZhuIntel opened 3 weeks ago

YongZhuIntel commented 3 weeks ago

We are trying to launch codegeex4-all-9b Using vllm following the CodeGeeX4 github: https://github.com/THUDM/CodeGeeX4?tab=readme-ov-file#vllm

The scripts are as following: codegeex_offline_example.py:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# CodeGeeX4-ALL-9B
# max_model_len, tp_size = 1048576, 4
# If OOM,please reduce max_model_len,or increase tp_size
max_model_len, tp_size = 2048, 4
model_name = "/llm/models/codegeex4-all-9b"
prompt = [{"role": "user", "content": "Hello"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

codegeex_offline_example.sh

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
python codegeex_offline_example.py

when running codegeex_offline_example.sh on docker we got the an error:

  File "/llm/vllm/vllm/model_executor/layers/attention/backends/torch_sdpa.py", line 112, in for
ward
    output = PagedAttentionImpl.forward_decode(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm/vllm/vllm/model_executor/layers/attention/ops/paged_attn.py", line 66, in forward_d
ecode
    ops.paged_attention_v1(
RuntimeError: "paged_attention_xpu_v1_impl" not implemented for 'BFloat16'

error log: codegeex_offline_example_error.log

gc-fu commented 3 weeks ago

Try adding torch_dtype="float16".

For instance:

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    torch_dtype="float16", # adding this
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
YongZhuIntel commented 3 weeks ago

Unable to recognize torch_dtype

Traceback (most recent call last):
  File "/llm/zhuyong/vllm/codegeex_offline_example.py", line 13, in <module>
    llm = LLM(
          ^^^^
  File "/llm/vllm/vllm/entrypoints/llm.py", line 91, in __init__
    engine_args = EngineArgs(
                  ^^^^^^^^^^^
TypeError: EngineArgs.__init__() got an unexpected keyword argument 'torch_dtype'
gc-fu commented 3 weeks ago

Sry, it is dtype="float16"

Uxito-Ada commented 2 weeks ago

Hi @YongZhuIntel ,

I successfully run codegex4-all-9b with vllm on a single card or two cards of A770. It is noted that for a single card, max-model-len should be decreased to no more than 6048, which is the size of kv cache store.

YongZhuIntel commented 2 weeks ago

@Uxito-Ada I run codegex4-all-9b with vllm on a single card for int4 format

model="/llm/models/codegeex4-all-9b"
served_model_name="codegeex4-all-9b"
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
source /opt/intel/oneapi/setvars.sh --force
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1

but got an OOM error when run "python vllm_online_benchmark.py codegeex4-all-9b 2"

    |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 769, in f
orward
    |     result = xe_linear.forward_new(x_2d, self.weight.data,
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: Allocation is out of device memory on current platform.
Uxito-Ada commented 2 weeks ago

Hi @YongZhuIntel ,

With the script your provide, I can successfully start vllm server and then execute the inference request in vLLM-Serving's README.

What version of ipex-llm are used in your environment? And please also provide codegeex_offline_example.py content, as request workloads also influence memory footprint.

YongZhuIntel commented 2 weeks ago

@Uxito-Ada I run vllm on the docker image: intelanalytics/ipex-llm-serving-vllm-xpu-experiment:latest

The vllm_online_benchmark.py: vllm_online_benchmark.py.txt

YongZhuIntel commented 2 weeks ago

INFO 08-27 09:33:39 gpu_executor.py:100] # GPU blocks: 12587, # CPU blocks: 6553 Error log: start_codegeex4-all-9b_serving_1card_int4_err.log

Uxito-Ada commented 2 weeks ago

Hi @YongZhuIntel ,

GPU memory consumption can be decreased by tuning server parameters, e.g. after lowing gpu-memory-utilization from 0.95 to 0.8~0.9, I can successfully execute workloads in vllm_online_benchmark.py with max_seq=2.