[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline

MatteoFasulo commented 2 months ago

System Info

Using transformers.js v3 in latest Chrome release on Windows 10.

GPU: Nvidia GTX 1080 (8GB)

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

Transcribe using Whisper model with WebGPU does not dispose the tensor after finishing the pipeline. Checked that in nvidia-smi while transcribing a .wav file into text. Memory consumption keeps growing until either it goes out-of memory (for smaller GPUs) or looses the device meanwhile the computation is going (producing console error saying 'device is lost').

Reproduction

Use transcribe pipeline for transcribing a wav file of at least 1 minute
Check GPU memory consumption while doing the computation (should increase meanwhile the computation is going on)
Verify if the tensor is correctly being disposed after the computation is done (here actually it does not dispose the data in the GPU hence generating leaks)
Can be easily verified using longer audio sequences (they enlarge the difference between resting GPU memory and meanwhile the computation)

Ideas (from Ratchet)

I spotted this great architecture of inference at https://github.com/huggingface/ratchet/blob/master/ARCHITECTURE.md in which the memory consumption for encoder-decoder model like Whisper is reduced by supporting both static & dynamic graphs as to have encoder completely static and decoder running under a dynamic graph due to KV caching.

flatsiedatsie commented 2 months ago

I'm seeing an error that only pops up after a while. Perhaps it's related?

MatteoFasulo commented 2 months ago

I'm seeing an error that only pops up after a while. Perhaps it's related?

Could be related, in my case it is strictly connected with very long audio sequences (up to 10 minutes).

flatsiedatsie commented 2 months ago

I think I'm seeing it too.

Mac OS, Brave browser, Transformers.js V3.

MatteoFasulo commented 2 months ago

I think I'm seeing it too.

Mac OS, Brave browser, Transformers.js V3.

Yes, if you want to double check try to use nvidia-smi from your command terminal and verify the GPU memory consumption.

xenova commented 2 months ago

Thanks to everyone for testing! Perhaps @guschmue and the ORT team can do some additional profiling to see what's going wrong.

flatsiedatsie commented 2 months ago

@MatteoFasulo I'm using a Macbook, so I don't have that option.

Also, my pipeline uses three models at the same time, which could perhaps play a role. My whisper worker is also using segmentation and verification models at the same time as whisper timestamped. Though those run via WASM, so.. hmm.

class PipelineSingleton {
    static asr_model_id = 'onnx-community/whisper-base_timestamped';
    static instance = null;
    static asr_instance = null;

    static segmentation_model_id = 'onnx-community/pyannote-segmentation-3.0';
    static segmentation_instance = null;
    static segmentation_processor = null;

    static verification_model_id = 'Xenova/wavlm-base-plus-sv'; // Xenova/wavlm-base-plus-sv
    //static verification_model_id = 'onnx-community/wespeaker-voxceleb-resnet34-LM';
    static verification_instance = null;
    static verification_processor = null;

    static async getInstance(progress_callback = null,model_name='onnx-community/whisper-base_timestamped',preferences={}) {
        //console.log("Whisper_worker: Pipeline: getInstance:  model_name, preferences: ", model_name, preferences);
        this.asr_model_id = model_name;

        PER_DEVICE_CONFIG[self.device] = {...PER_DEVICE_CONFIG[self.device],preferences}

        this.asr_instance ??= pipeline('automatic-speech-recognition', this.asr_model_id, {
            ...PER_DEVICE_CONFIG[self.device],
            progress_callback,
        });

        this.segmentation_processor ??= AutoProcessor.from_pretrained(this.segmentation_model_id, {
            ...preferences,
            progress_callback,
        });
        this.segmentation_instance ??= AutoModelForAudioFrameClassification.from_pretrained(this.segmentation_model_id, {
            // NOTE: WebGPU is not currently supported for this model
            // See https://github.com/microsoft/onnxruntime/issues/21386
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_processor ??= AutoProcessor.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_instance ??= AutoModel.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        return Promise.all([this.asr_instance, this.segmentation_processor, this.segmentation_instance, this.verification_processor, this.verification_instance]);
    }
}

flatsiedatsie commented 2 months ago

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

whisper_memory_cleaning

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

MatteoFasulo commented 2 months ago

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

Right now I do not have all the data but in my case the dispose was not working as intended. I tried to apply dispose at the end of each task but the GPU memory consumption was still the same as if the dispose was not applied.

MatteoFasulo commented 1 month ago

@xenova I tested default whisper-tiny as well as whisper-small.en_timestamped using the ORT Web Perf tool by @guschmue .

However, the tool does not provide any information about memory footprint nor GPU memory consumption.

Do you have any suggestion about checking if the error depends only on Whisper models or if there is something else which does not clear used memory before starting new tasks?

deanihansen commented 1 month ago

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

gyagp commented 1 month ago

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

deanihansen commented 1 month ago

@gyagp I've repro'ed this locally by:

Pull this example here https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper
Update the example's references to transformers by..
Replace the package.json reference for '"@xenova/transformers' with the latest "@huggingface/transformers": "^3.0.0-alpha.14",
Replace import reference to @xenova in worker.js with @huggingface
Run it in your favorite browser for a while (the longer, the better)
Open the Tab and hit "Load Model"
Wait for a good while with the mic running
Observe e.g. Chrome GPU Memory usage increase forever.

The screen below is from activity monitor on MacOS after 2 minutes of running the model in Chrome

MatteoFasulo commented 1 month ago

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

Hi,
The steps shared by @deanihansen are quite similar to what I was doing and experiencing. The GPU memory usage keeps increasing indefinitely, even after a task has completed its computation.

Indeed, you can follow those steps to reproduce the error and investigate the potential cause.

As a side note, in my specific case, I also attempted to call the .dispose() method to release the tensors. According to the ONNX documentation, this should remove the data by deleting its internal reference if it's on the CPU, or by releasing the data if it's on the GPU.

After calling this function, the tensor is no longer valid, and its location is set to 'none'.

Additionally, I noticed there is a .release() method for InferenceSession in ONNX, which is supposed to release the inference session and its underlying resources.

xenova commented 1 month ago

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

100% agree! I might have an idea what the problem is and how to fix this. Fingers crossed 🤞

flatsiedatsie commented 1 month ago

A new record :-)

deanihansen commented 1 month ago

Any thoughts on your end @gyagp?

gyagp commented 1 month ago

@deanihansen , thanks for the instructions, and I can already reproduce the issue at my side. Now I'm a bit stuck by ONNX Runtime build issue on Windows. I will put this as high priority. Please stay tuned.

gyagp commented 1 month ago

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

xenova commented 1 month ago

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

I should be: https://github.com/xenova/transformers.js/blob/86d6da468021dfd53d6abf35a67a8f77e7ced7c0/src/models.js#L1556-L1580

chandeldivyam commented 1 month ago

@MatteoFasulo Were you able to make it work with any solution or workaround?

MatteoFasulo commented 1 month ago

@MatteoFasulo Were you able to make it work with any solution or workaround?

Unfortunately, no, I haven't been able to make it work with any solution or workaround yet.

chandeldivyam commented 1 month ago

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

MatteoFasulo commented 1 month ago

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

That sounds reasonable, but I would suggest focusing on fixing the issue within the framework itself rather than creating external workarounds (though what you're doing now is fine as a temporary solution). I still believe there might be an error during computation that isn't properly disposing of tensors, leading to gradual memory allocation increases over time.

gyagp commented 1 month ago

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage. More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management. I think it's not hard for @xenova to fix this as he already has the related code.

MatteoFasulo commented 1 month ago

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage. More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management. I think it's not hard for @xenova to fix this as he already has the related code.

Ok, thanks for the clarification. I'll wait for that to be fixed by @xenova :)

xenova commented 1 month ago

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage.

More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management.

I think it's not hard for @xenova to fix this as he already has the related code.

Can you suggest where in the code a call to dispose() is being missed? See https://github.com/xenova/transformers.js/issues/860#issuecomment-2337696822; this should already be handled during generation.

xenova commented 1 month ago

Okay, I believe I have figured it out. Basically, I was only freeing the decoder PKVs after generation, and not the encoder PKVs (since they are re-used), but should be freed once no longer needed (after the last token is generated).

I will push the update soon for testing.

gyagp commented 1 month ago

Sorry @xenova , I misunderstood your comment above, and thought you already knew the root cause. I should be more explicit. What I meant is replaceTensors() in models.js, which should also dispose the tensor (kv cache) in previous runs. Hopefully we can have an unified solution for all the related code in Transformers.js.

xenova commented 1 month ago

@gyagp Here's my attempted fix: https://github.com/xenova/transformers.js/commit/969d10e1c610a611c9b9479f66e7419d04691615

but I don't think it fixes it entirely (cc @flatsiedatsie maybe you can test?). I also disabled the encoder being output on the GPU since I think this is where the leak comes from.

Install via source:

npm install xenova/transformers.js#v3

gyagp commented 1 month ago

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak. A typical usage pattern is like: We set preferredOutputLocation as "gpu-buffer" to keep outputs (present) in GPU when creating the session. 1st run: Inputs are in CPU, and we feed the initial data. 2nd run: Assign inputs (past_key_values) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on. From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run. Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

flatsiedatsie commented 4 weeks ago

It seems to have worked!

I'm transcribing a large video file, and memory use remains constant.

fs-eire commented 4 weeks ago

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak. A typical usage pattern is like: We set preferredOutputLocation as "gpu-buffer" to keep outputs (present) in GPU when creating the session. 1st run: Inputs are in CPU, and we feed the initial data. 2nd run: Assign inputs (past_key_values) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on. From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run. Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

Currently, the rule of tensor's lifecycle is straightforward (from ORT's perspective):

anything created by caller (eg. Tensor.fromGpuBuffer()) should be release by caller.
if a GPU tensor is created by ORT (as output), then user need to call dispose() on it when finished using it.

ort-web supports to specify pre-allocated tensor as output (not via preferredOutputLocation). If you know the type and shape of an output in advance, it's allowed to create a GPU tensor and call session.run with fetches so that onnxruntime-web will try to use that tensor. It may work if specifying a user created tensor using the same underlying GPU buffer for input and output.

I think ORT explicitly disallow its allocator to reuse input's buffer as output's buffer (when preferredOutputLocation === 'gpu-buffer'). There are a few reasons:

user may want to use the input's buffer after inference, or at least user may assume the input is not changed.
it's hard to determine whether the input's buffer can be reused or not, depending on output's type and shape. it's also hard to determine which output reuses which input if not specified explicitly by user.
it will make the lifecycle rule above more complicated, as in (2) user need to call dispose() only on the outputs that didn't reuse input's buffer

gyagp commented 4 weeks ago

@fs-eire Thanks for the clarification, which is very helpful! I agree to reuse the input buffer as output buffer will cause a lot of complexity. However, kv cache is fundamental for transformers based models, so to reuse the buffer will bring a good perf gain. This could be another topic we need to discuss further, but it's not related to this issue. So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

flatsiedatsie commented 4 weeks ago

I just noticed this while running Moondream 2:

It seems to have stopped after generating a single word.

It seems to happen when two Transformers.js workers are loaded simultanously.

Using Whisper to transcribe voice (whisper remains loaded after the inference is complete).
Feeding the transcription as a prompt to Moondream2, running in a separate worker.

I'll quickly check if this is related to the commit or was always an issue that I just hadn't spotted.

// Nope, with the old version the issue does not occur.

Did another test: having Whisper and the Translation worker running simultaneously is not an issue.

// Just noticed that it has happened to the Whisper worker too:

xenova commented 3 weeks ago

So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

@gyagp I agree, however, as described above, this is what seems to be causing the memory leak.

gyagp commented 3 weeks ago

I think in getPastKeyValues(), kv caches for encoder are not disposed as expected. See below code:

// Optimization introduced by optimum to reuse past key values. So, we just replace the constant
// outputs with the previous past key values.
// https://github.com/huggingface/optimum/blob/0bf2c05fb7e1182b52d21b703cfc95fd9e4ea3dc/optimum/onnxruntime/base.py#L677-L704
pkvs[newName] = pastKeyValues[newName];

xenova commented 3 weeks ago

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

gyagp commented 3 weeks ago

Above code and below code (A bug in ORT?) look a bit strange to me. I need to dig a bit more about the code next week.

// (otherwise, this causes a memory leak or throws an error "Error: previous buffer is not registered")
if (key.includes('encoder')) continue;

flatsiedatsie commented 3 weeks ago

@gyagp Did you per chance manage to find anything?

gyagp commented 2 weeks ago

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

Debugged the code today, and I think the current code has no memory leak. I also tried with the latest ORT code, and "Error: previous buffer is not registered" should have been fixed by https://github.com/microsoft/onnxruntime/pull/22254 (https://www.npmjs.com/package/onnxruntime-web/v/1.20.0-dev.20240928-1bda91fc57 should already include this fix). So "if (key.includes('encoder')) continue;" in models.js is no longer needed and we can also keep encoder KVs on GPU, which is more performant.

flatsiedatsie commented 2 weeks ago

Interesting.

I tried a long transcription yesterday, and the memory use was high, but didn't seem to slowly grow. I checked, because I was worried that switching to Alpha 17 - away from the a version with the test-fix that I had been relying on - would cause trouble.

xenova commented 2 weeks ago

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

MatteoFasulo commented 2 weeks ago

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

Well done, thank you @xenova!

huggingface / transformers.js