intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

Running ChatGLM3-6B on A380 by BigDL,it suspends all the time #9814

Open dlod-openvino opened 8 months ago

dlod-openvino commented 8 months ago

OS: Win10 22H2 19045.3803 Python=3.9 and install the env according to https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html

Test code:

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

Run the python script by:

call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
set SYCL_CACHE_PERSISTENT=1
python chatglm3_infer_gpu.py

the code suspends all the time as below: 1704183291054

when modify the code, and run it on CPU, it works! test code:

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")
# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

1704183480054

qiuxin2012 commented 8 months ago

A380's 6GB memory is not enough to run chatglm3-6b now.

qiuxin2012 commented 8 months ago

You can try to add a parameter cpu_embedding=True to AutoModel.from_pretrained, and try again on A380. Maybe you need to wait for about 10-20 minutes.

dlod-openvino commented 8 months ago

A380's 6GB memory is not enough to run chatglm3-6b now.

As the screenshoot, after loading the ChatGLM3-6B into A380's memory, it shows 4.4GB consumption, is A380's 6GB memory not enough? many low power dGPU like A380 ONLY has 6GB memory, it's important to support the low power dGPU on the edge, eg. LLM+Robot application

openvino-book commented 8 months ago

A380's 6GB memory is not enough to run chatglm3-6b now.

As the screenshoot, after loading the ChatGLM3-6B into A380's memory, it shows 4.4GB consumption, is A380's 6GB memory not enough? many low power dGPU like A380 ONLY has 6GB memory, it's important to support the low power dGPU on the edge, eg. LLM+Robot application

ChatGLM3-6B Run successfully on A380 6eabfe2e2a3d96a84b9fb865f2f6ccf The Test Platform is below image

qiuxin2012 commented 8 months ago

We has run successfully on our A380, too. Please make sure you has set set SYCL_CACHE_PERSISTENT=1, otherwise the compiling will cost about 7 minutes every time. If you has set this env, you just need to compile once for the first time. The second run will be very fast.

openvino-book commented 8 months ago

三步完成在英特尔独立显卡上量化和部署 ChatGLM3-6B 模型