Open K-Alex13 opened 11 months ago
You mean when you chat with the model, the memory keeps increasing but doesn't decrease after the chat finishes?
Could you provide more details? e.g. what model you are using, any specific the code for us to reproduce this?
The detail I can provide is that I do not put the embedding to cpu and I use Baichuan2 model. The main question is the don't release memory. Following are code model initial code: model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval() model = model.to('xpu') tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
chat code: response = model.chat(tokenizer, content,stream=True) (just using the original baichuan code)
I cannot reproduce your problem on Windows11 System. The memory used by CPU is quite stable as the chat stream going. Here are my steps: HW & OS:13th Gen Intel(R) Core(TM) i9-13900K; Intel(R) Arc(TM) A770 Graphics; Windows 11 Test env: bigdl-llm 2.5.0b20231222 Note: All the results were tested without cpu embedding (which may cause more cpu usage).
I verify the issue based on the codes provided in Baichuan2-13B-Chat repo
from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex
import os
import platform
import subprocess
from colorama import Fore, Style
from tempfile import NamedTemporaryFile
model_path = "D:\llm-models\Baichuan2-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True,
optimize_model=True).bfloat16().eval()
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
model = model.to('xpu')
messages = []
while True:
prompt = input(Fore.GREEN + Style.BRIGHT + "\n用户:" + Style.NORMAL)
if prompt.strip() == "exit":
break
print(Fore.CYAN + Style.BRIGHT + "\nBaichuan 2:" + Style.NORMAL, end='')
messages.append({"role": "user", "content": prompt})
position = 0
try:
for response in model.chat(tokenizer, messages, stream=True):
print(response[position:], end='', flush=True)
position = len(response)
torch.xpu.empty_cache()
except KeyboardInterrupt:
pass
print()
messages.append({"role": "assistant", "content": response})
I chat ten rounds with the model and append the history in chat
API and didn't notice allocated memory increases. The memory increases fast when loading the model but maintain a relatively stable level(from 40s)during the chatting stage.
Here's my code for memory capture and python script for memory usage plot.
And GPU memory is at a stable level too.
Each time when I interact with the model, the memory occupied by the model increases and does not release memory resources. As a result, when there are many conversations, it is very easy for the model to crash. How can I solve this problem?