Open jeremyfowers opened 3 weeks ago
Since you are using graph capture, can you try deleting the generator object after generation is completed?
I already tried putting del generator and del params at the bottom of my function. I still saw a memory leak.
There has been a recent ONNX Runtime fix and a recent ONNX Runtime GenAI fix for a memory leak issue with DirectML. These fixes will be in the upcoming ONNX Runtime GenAI v0.3.0 release, which is expected to be released this week, and may fix your issue. In the meantime, you can re-build both ONNX Runtime and ONNX Runtime GenAI using the latest commits on the main branches and see if your issue is resolved.
Thanks for the heads up!
@kunal-vaishnavi are there any updates on the 0.3.0 release?
@kunal-vaishnavi, very nice meeting you in person the other day!
Today I downloaded 0.3.0 and still saw the memory leak during my MMLU test. So, I decided to dig further and found something interesting.
The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.
Here is a quick psuedocode that has no memory leak:
prompt = random_sentence() # generate a sentence of random words with between 100-200 words
for _ in range(1000):
input_ids = tokenzer.encode(prompt)
response = model.generate(input_ids)
And here is a psuedocode that does show the memory leak:
for _ in range(1000):
prompt = random_sentence() # generate a sentence of random words with between 100-200 words
input_ids = tokenzer.encode(prompt)
response = model.generate(input_ids)
The only difference between these two programs is that the plain-text prompt changes between loop iterations.
PS. I still get the memory leak even when I do not call tokenizer.decode(reponse) at all, which is why I omitted it from the examples.
I am experiencing a memory leak while running my application, which is to run an MMLU accuracy test on my Radeon 780M iGPU via DirectML.
Each inference adds tens-hundreds of megabytes to the total system memory and total graphics memory utilized, until it eventually fills up after about 50 inferences and crashes the system.
My system
Software
The model is running on the Radeon 780M iGPU,
My Code
I define a
generate()
function like this, that is meant to return all the response tokens to the input_ids from a prompt.Then, I call
generate(tokenizer(prompt), max_new_tokens=1)
dozens of times while running the MMLU accuracy test. Each prompt adds a bit more memory utilization until the system crashes.Screenshots
Here is a screenshot of system and iGPU memory utilization. It is climbing like a staircase due to the memory leak, when it should be flat.
![image](https://github.com/microsoft/onnxruntime-genai/assets/80718789/1a06eb23-6190-4b73-ad73-a9a4444e1a28)
For reference, here is the exact same MMLU accuracy test code running on a Huggingface Transformers implementation of Phi-3-Mini on CPU. Memory utilization is flat, as expected.
The Question
What do I do about this memory leak? Do I need to do some explicit garbage collection in my code to make my
generate()
function save to run many times in a loop?