intel-analytics / ipex-llm-tutorial

Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using ipex-llm
https://github.com/intel-analytics/bigdl
Apache License 2.0
140 stars 35 forks source link

about the memory problem #64

Open K-Alex13 opened 9 months ago

K-Alex13 commented 9 months ago

img_v3_0269_c20cbf2c-b81b-4866-914b-d470413adebg Each time when I interact with the model, the memory occupied by the model increases and does not release memory resources. As a result, when there are many conversations, it is very easy for the model to crash. How can I solve this problem?

hkvision commented 9 months ago

You mean when you chat with the model, the memory keeps increasing but doesn't decrease after the chat finishes?

Could you provide more details? e.g. what model you are using, any specific the code for us to reproduce this?

K-Alex13 commented 9 months ago

The detail I can provide is that I do not put the embedding to cpu and I use Baichuan2 model. The main question is the don't release memory. Following are code model initial code: model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval() model = model.to('xpu') tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

chat code: response = model.chat(tokenizer, content,stream=True) (just using the original baichuan code)

Ariadne330 commented 8 months ago

I cannot reproduce your problem on Windows11 System. The memory used by CPU is quite stable as the chat stream going. Here are my steps: HW & OS:13th Gen Intel(R) Core(TM) i9-13900K; Intel(R) Arc(TM) A770 Graphics; Windows 11 Test env: bigdl-llm 2.5.0b20231222 Note: All the results were tested without cpu embedding (which may cause more cpu usage).

Test codes

I verify the issue based on the codes provided in Baichuan2-13B-Chat repo

from bigdl.llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex

import os
import platform
import subprocess
from colorama import Fore, Style
from tempfile import NamedTemporaryFile

model_path = "D:\llm-models\Baichuan2-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True,
                                             optimize_model=True).bfloat16().eval()

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

model = model.to('xpu')

messages = []

while True:
    prompt = input(Fore.GREEN + Style.BRIGHT + "\n用户:" + Style.NORMAL)
    if prompt.strip() == "exit":
        break
    print(Fore.CYAN + Style.BRIGHT + "\nBaichuan 2:" + Style.NORMAL, end='')

    messages.append({"role": "user", "content": prompt})
    position = 0
    try:
        for response in model.chat(tokenizer, messages, stream=True):
            print(response[position:], end='', flush=True)
            position = len(response)
            torch.xpu.empty_cache()
    except KeyboardInterrupt:
        pass
    print()
    messages.append({"role": "assistant", "content": response})

Test results

I chat ten rounds with the model and append the history in chat API and didn't notice allocated memory increases. The memory increases fast when loading the model but maintain a relatively stable level(from 40s)during the chatting stage. memory_usage_plot_load

Here's my code for memory capture and python script for memory usage plot.

PowerShell script for memory capture
```shell while($true) { Get-Process | Measure-Object -Property WS -Sum | ForEach-Object { "Total Memory Usage: $($_.Sum / 1MB) MB" } | Out-File test.log -Append Start-Sleep -Milliseconds 10 } ```
Python script for plotting the results
```python import matplotlib.pyplot as plt data = [] count = 0 with open('./test.log', 'r',encoding="utf-16") as file: for line in file.readlines()[:-2]: mem = line.split()[3 data.append(float(mem)) x = [i for i in range(len(data))] plt.plot(x, data, linestyle='-') plt.xlabel('Time') plt.ylabel('Used/MB') plt.title('Used Memory Over Time') plt.legend() plt.grid(True) plt.ylim(min(data)-100, max(data)+2000) plt.savefig('memory_usage_plot_load.png') ```

And GPU memory is at a stable level too.

Untitled (5)