intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

failure to launch codegeex4-all-9b Using vllm #11910

Open YongZhuIntel opened 3 months ago

YongZhuIntel commented 3 months ago

We are trying to launch codegeex4-all-9b Using vllm following the CodeGeeX4 github: https://github.com/THUDM/CodeGeeX4?tab=readme-ov-file#vllm

The scripts are as following: codegeex_offline_example.py:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# CodeGeeX4-ALL-9B
# max_model_len, tp_size = 1048576, 4
# If OOM,please reduce max_model_len,or increase tp_size
max_model_len, tp_size = 2048, 4
model_name = "/llm/models/codegeex4-all-9b"
prompt = [{"role": "user", "content": "Hello"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

codegeex_offline_example.sh

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
python codegeex_offline_example.py

when running codegeex_offline_example.sh on docker we got the an error:

  File "/llm/vllm/vllm/model_executor/layers/attention/backends/torch_sdpa.py", line 112, in for
ward
    output = PagedAttentionImpl.forward_decode(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm/vllm/vllm/model_executor/layers/attention/ops/paged_attn.py", line 66, in forward_d
ecode
    ops.paged_attention_v1(
RuntimeError: "paged_attention_xpu_v1_impl" not implemented for 'BFloat16'

error log: codegeex_offline_example_error.log

gc-fu commented 3 months ago

Try adding torch_dtype="float16".

For instance:

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    torch_dtype="float16", # adding this
    # If OOM,try using follong parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
YongZhuIntel commented 3 months ago

Unable to recognize torch_dtype

Traceback (most recent call last):
  File "/llm/zhuyong/vllm/codegeex_offline_example.py", line 13, in <module>
    llm = LLM(
          ^^^^
  File "/llm/vllm/vllm/entrypoints/llm.py", line 91, in __init__
    engine_args = EngineArgs(
                  ^^^^^^^^^^^
TypeError: EngineArgs.__init__() got an unexpected keyword argument 'torch_dtype'
gc-fu commented 3 months ago

Sry, it is dtype="float16"

Uxito-Ada commented 3 months ago

Hi @YongZhuIntel ,

I successfully run codegex4-all-9b with vllm on a single card or two cards of A770. It is noted that for a single card, max-model-len should be decreased to no more than 6048, which is the size of kv cache store.

YongZhuIntel commented 3 months ago

@Uxito-Ada I run codegex4-all-9b with vllm on a single card for int4 format

model="/llm/models/codegeex4-all-9b"
served_model_name="codegeex4-all-9b"
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
source /opt/intel/oneapi/setvars.sh --force
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1

but got an OOM error when run "python vllm_online_benchmark.py codegeex4-all-9b 2"

    |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 769, in f
orward
    |     result = xe_linear.forward_new(x_2d, self.weight.data,
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: Allocation is out of device memory on current platform.
Uxito-Ada commented 3 months ago

Hi @YongZhuIntel ,

With the script your provide, I can successfully start vllm server and then execute the inference request in vLLM-Serving's README.

What version of ipex-llm are used in your environment? And please also provide codegeex_offline_example.py content, as request workloads also influence memory footprint.

YongZhuIntel commented 3 months ago

@Uxito-Ada I run vllm on the docker image: intelanalytics/ipex-llm-serving-vllm-xpu-experiment:latest

The vllm_online_benchmark.py: vllm_online_benchmark.py.txt

YongZhuIntel commented 3 months ago

INFO 08-27 09:33:39 gpu_executor.py:100] # GPU blocks: 12587, # CPU blocks: 6553 Error log: start_codegeex4-all-9b_serving_1card_int4_err.log

Uxito-Ada commented 3 months ago

Hi @YongZhuIntel ,

GPU memory consumption can be decreased by tuning server parameters, e.g. after lowing gpu-memory-utilization from 0.95 to 0.8~0.9, I can successfully execute workloads in vllm_online_benchmark.py with max_seq=2.