microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
314 stars 71 forks source link

Memory leak during back-to-back inferences #590

Open jeremyfowers opened 3 weeks ago

jeremyfowers commented 3 weeks ago

I am experiencing a memory leak while running my application, which is to run an MMLU accuracy test on my Radeon 780M iGPU via DirectML.

Each inference adds tens-hundreds of megabytes to the total system memory and total graphics memory utilized, until it eventually fills up after about 50 inferences and crashes the system.

My system

Software

The model is running on the Radeon 780M iGPU,

My Code

I define a generate() function like this, that is meant to return all the response tokens to the input_ids from a prompt.

def generate(
        input_folder,
        input_ids,
        max_new_tokens=512,
        do_sample=True,
        top_k=50,
        top_p=1.0,
        temperature=0.7,
        pad_token_id=None,
    ):
        model = og.Model(input_folder)
        params = og.GeneratorParams(model)

        if pad_token_id:
            params.pad_token_id = pad_token_id

        max_length = len(input_ids) + max_new_tokens

        params.input_ids = input_ids
        params.set_search_options(
            do_sample=do_sample,
            top_k=top_k,
            top_p=top_p,
            temperature=temperature,
            max_length=max_length,
            min_length=max_length,
        )
        params.try_graph_capture_with_max_batch_size(1)

        generator = og.Generator(model, params)

        prompt_start_time = time.perf_counter()
        generator.compute_logits()
        generator.generate_next_token()
        prompt_end_time = time.perf_counter()

        time_to_first_token = prompt_end_time - prompt_start_time

        if max_new_tokens > 1:

            token_gen_times = []
            while not generator.is_done():
                token_gen_start_time = time.perf_counter()
                generator.compute_logits()
                generator.generate_next_token()
                token_gen_end_time = time.perf_counter()

                token_gen_times.append(token_gen_end_time - token_gen_start_time)

            if token_gen_times:
                # List will be empty if we generated 1 or 0 tokens, and we don't
                # want a divide-by-zero error in those cases
                avg_token_gen_latency_s = sum(token_gen_times) / len(
                    token_gen_times
                )
                tokens_per_second = 1 / avg_token_gen_latency_s

        return [generator.get_sequence(0)]

Then, I call generate(tokenizer(prompt), max_new_tokens=1) dozens of times while running the MMLU accuracy test. Each prompt adds a bit more memory utilization until the system crashes.

Screenshots

Here is a screenshot of system and iGPU memory utilization. It is climbing like a staircase due to the memory leak, when it should be flat. image image

For reference, here is the exact same MMLU accuracy test code running on a Huggingface Transformers implementation of Phi-3-Mini on CPU. Memory utilization is flat, as expected.

image

The Question

What do I do about this memory leak? Do I need to do some explicit garbage collection in my code to make my generate() function save to run many times in a loop?

kunal-vaishnavi commented 3 weeks ago

Since you are using graph capture, can you try deleting the generator object after generation is completed?

https://github.com/microsoft/onnxruntime-genai/blob/8608d1317d9a97101adb3f4fea7889eb348445fb/examples/python/phi3-qa.py#L71-L72

jeremyfowers commented 3 weeks ago

I already tried putting del generator and del params at the bottom of my function. I still saw a memory leak.

kunal-vaishnavi commented 3 weeks ago

There has been a recent ONNX Runtime fix and a recent ONNX Runtime GenAI fix for a memory leak issue with DirectML. These fixes will be in the upcoming ONNX Runtime GenAI v0.3.0 release, which is expected to be released this week, and may fix your issue. In the meantime, you can re-build both ONNX Runtime and ONNX Runtime GenAI using the latest commits on the main branches and see if your issue is resolved.

jeremyfowers commented 3 weeks ago

Thanks for the heads up!

jeremyfowers commented 2 weeks ago

@kunal-vaishnavi are there any updates on the 0.3.0 release?

kunal-vaishnavi commented 2 weeks ago

There have been some last-minute PRs that need to be included in the release such as this one. The changes for the v0.3.0 release branch can be tracked here. Once merged, v0.3.0 should be released by end of this week.

jeremyfowers commented 2 days ago

@kunal-vaishnavi, very nice meeting you in person the other day!

Today I downloaded 0.3.0 and still saw the memory leak during my MMLU test. So, I decided to dig further and found something interesting.

The memory leak presented during MMLU, but not during performance benchmarking. I dug further and found the only meaningful difference between my MMLU and benchmark code was that MMLU delivered a unique prompt on every iteration, whereas my benchmark reused the same prompt across iterations.

Here is a quick psuedocode that has no memory leak:

prompt = random_sentence() # generate a sentence of random words with between 100-200 words
for _ in range(1000):
  input_ids = tokenzer.encode(prompt)
  response = model.generate(input_ids)

And here is a psuedocode that does show the memory leak:

for _ in range(1000):
  prompt = random_sentence() # generate a sentence of random words with between 100-200 words
  input_ids = tokenzer.encode(prompt)
  response = model.generate(input_ids)

The only difference between these two programs is that the plain-text prompt changes between loop iterations.

PS. I still get the memory leak even when I do not call tokenizer.decode(reponse) at all, which is why I omitted it from the examples.