Open YongZhuIntel opened 3 months ago
Try adding torch_dtype="float16".
For instance:
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
torch_dtype="float16", # adding this
# If OOM,try using follong parameters
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
Unable to recognize torch_dtype
Traceback (most recent call last):
File "/llm/zhuyong/vllm/codegeex_offline_example.py", line 13, in <module>
llm = LLM(
^^^^
File "/llm/vllm/vllm/entrypoints/llm.py", line 91, in __init__
engine_args = EngineArgs(
^^^^^^^^^^^
TypeError: EngineArgs.__init__() got an unexpected keyword argument 'torch_dtype'
Sry, it is dtype="float16"
Hi @YongZhuIntel ,
I successfully run codegex4-all-9b with vllm on a single card or two cards of A770. It is noted that for a single card, max-model-len
should be decreased to no more than 6048, which is the size of kv cache store.
@Uxito-Ada I run codegex4-all-9b with vllm on a single card for int4 format
model="/llm/models/codegeex4-all-9b"
served_model_name="codegeex4-all-9b"
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
source /opt/intel/oneapi/setvars.sh --force
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 1
but got an OOM error when run "python vllm_online_benchmark.py codegeex4-all-9b 2"
| File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 769, in f
orward
| result = xe_linear.forward_new(x_2d, self.weight.data,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| RuntimeError: Allocation is out of device memory on current platform.
Hi @YongZhuIntel ,
With the script your provide, I can successfully start vllm server and then execute the inference request in vLLM-Serving's README.
What version of ipex-llm are used in your environment? And please also provide codegeex_offline_example.py
content, as request workloads also influence memory footprint.
@Uxito-Ada I run vllm on the docker image: intelanalytics/ipex-llm-serving-vllm-xpu-experiment:latest
The vllm_online_benchmark.py: vllm_online_benchmark.py.txt
INFO 08-27 09:33:39 gpu_executor.py:100] # GPU blocks: 12587, # CPU blocks: 6553 Error log: start_codegeex4-all-9b_serving_1card_int4_err.log
Hi @YongZhuIntel ,
GPU memory consumption can be decreased by tuning server parameters, e.g. after lowing gpu-memory-utilization
from 0.95 to 0.8~0.9, I can successfully execute workloads in vllm_online_benchmark.py
with max_seq=2
.
We are trying to launch codegeex4-all-9b Using vllm following the CodeGeeX4 github: https://github.com/THUDM/CodeGeeX4?tab=readme-ov-file#vllm
The scripts are as following: codegeex_offline_example.py:
codegeex_offline_example.sh
when running codegeex_offline_example.sh on docker we got the an error:
error log: codegeex_offline_example_error.log