intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.27k stars 1.23k forks source link

Can't release memory via del model after model.generate() #11394

Open ganghe opened 2 weeks ago

ganghe commented 2 weeks ago

Hi team,

I want to release the related memory via del model variable after model generate, but it does not work as my expectation. The demo code is as below,

import torch import time import numpy as np

import intel_extension_for_pytorch as ipex

from ipex_llm.transformers import AutoModelForCausalLM from transformers import AutoTokenizer

model_path = "./baichuan2_model/baichuan-inc/Baichuan2-7B-Chat"

Load model in 4 bit,

# which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.

model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True) model = model.half().to('xpu')

# Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

prompt = "北京有哪些景点?" input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

ipex_llm model needs a warmup, then inference time can be accurate

output = model.generate(input_ids,max_new_tokens=32) torch.xpu.synchronize() output = output.cpu() output_str = tokenizer.decode(output[0], skip_special_tokens=True) print(output_str)

input("please input enter to del model:") model.to('cpu') torch.xpu.synchronize() torch.xpu.empty_cache() del model import gc gc.collect() input("please input enter to exit:")

I can see the memory usage for the python process is still here before the process exists. my environment is, Linux Ubuntu 22.04 oneapi 24.01 ipex-llm 2.1.0b20240610

ganghe commented 2 weeks ago

test.py.txt The demo python code

qiuxin2012 commented 2 weeks ago

It's the right behavior of Python virtual machine, we can't force Python VM releasing it's memory on CPU. After you del the model, the memory is empty in VM. If you load a new model, python process won't apply new memory.

ganghe commented 1 week ago

Hi Qiuxin,

Based on my observations, if you do this step(load model+model.generate+del model) for multiple times(in the same process), the process' vm usage will become huge, then system oom will kill this process. Maybe you guys can try to reproduce this case, to see if we can improve this situation, or not.

Thanks Gang

qiuxin2012 commented 6 days ago

I can't reproduce after 20 times, on current nightly 2.1.0b20240701+ oneapi 2024.0 + intel-extension-for-pytorch 2.1.10+xpu