microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
419 stars 95 forks source link

Memory leak in the multimodal processor of Phi-3-vision #851

Open hiro28844 opened 2 weeks ago

hiro28844 commented 2 weeks ago

Describe the bug When calling the Phi-3-vision multimodal processor, a memory leak appears to occur, causing memory usage to continuously increase.

To Reproduce Run the following script:

from io import BytesIO
from tempfile import mkstemp

import requests
from PIL import Image

import onnxruntime_genai as og

print("Loading model...")
model = og.Model("/path/to/Phi-3-vision-128k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4")
processor = model.create_multimodal_processor()

response = requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", timeout=30)
_, image_path = mkstemp(suffix=".png")
Image.open(BytesIO(response.content)).convert("RGB").save(image_path)
image = og.Images.open(image_path)

while True:
    prompt = "<|user|>\n"
    prompt += "<|image_1|>\n"
    text = "Please describe the image in detail."
    prompt += f"{text}<|end|>\n<|assistant|>\n"
    print("Processing image and prompt...")
    r = processor(prompt, images=image)
    del r

Expected behavior Memory usage remains constant no matter how many times the multimodal processor is called.

Desktop (please complete the following information):

natke commented 2 weeks ago

Hi @hiro28844, does it remain constant or grow? We do pre-allocate memory for the KV-cache to improve performance

hiro28844 commented 2 weeks ago

Hi @natke ,

does it remain constant or grow?

It continues to grow. Attach a video when I run the script above.

https://github.com/user-attachments/assets/b03ca877-4933-4a03-ab33-aa861cab0fcc

i-dubits commented 4 days ago

Confirm memory leak in Phi-3-vision. Probably this is related to the following issue: https://github.com/microsoft/onnxruntime-genai/issues/590 But fixing 'max_length' parameter suggested by @PatriceVignola does not change anything for me. Probably the image-text processing is somewhat different from text alone. I am using cuda onnx-genai version. The GPU memory remains constant but CPU memory increases every iteration