Zombies in memory - something is blocking (re)loading of Whisper after a page is closed and re-opened

flatsiedatsie commented 2 weeks ago

Question

I've been trying to debug this issue all afternoon, but haven't gotten any further. The code runs on desktop, but not on Android Chrome.

This is with V3 Alpha 19.

flatsiedatsie commented 2 weeks ago

The demo on HuggingFace does work on the phone... hmm. I'll have to dive deeper.

flatsiedatsie commented 1 week ago

I did a rewrite to more closely follow the recent examples in the hopes that that would fix the issue. But after all that, I still get the same type of error. It still only occurs on mobile (Android Chrome), everything runs great on desktop. I've even implemented the option to go turbo while I was at it.

But since the demo did run on the phone, it must be my code/situation.

Things I've tried:

Made sure that it's not running twice.
I've attempted to go smaller, going as small as running whisper-tiny with quantization. But in the errors below, even grabbing 37MB of memory fails.
Going back to a way older version of the worker, and also trying a revert to Alpha 18 and 16.
Yesterday a new Android Chrome version was available, but updating to that didn't solve it, once more indicating an issue with my code/situation.
I reverted some code involving Transformers.js outside of the Whisper worker, to once again use/load Transformers.js in the main code too, and not just in workers. But that wasn't it.
Could it be the combination with having other LLM libraries loaded? E.g. WebLLM and Wllama. Yet, in my tests they are not active, and I've even resorted to nulling Wllama as at one point after I thought I noticed some response from Wllama when I called Transformers.js. But that wasn't it.

Some aveneus I could still explore, simply out of desperation:

Move away from using pipeline to a more 'manual' implementation of the transcribe function.

The errors imply a memory issue. But at this point Whisper is the only AI running.

My next step is to ceate a minimal viable example to call the worker, to see if I can rule out interference with another library.

flatsiedatsie commented 1 week ago

Some small questions:

In some example code I noticed

export default {
    DEFAULT_LANGUAGE: "english",
    (etc)
};

Yet in the rest of the code the language is always set with a two-letter code?

I've also seen the code being set, yet sometimes get errors saying that for english-only models, the language should not be explicitly set.

flatsiedatsie commented 1 week ago

I created the minimal test case, and the error still exists. I also noticed something: Transformers.js seems to try and claim memory before the model has fully loaded:

flatsiedatsie commented 1 week ago

Hmm, I did a full cache clean on the phone again and made some modifications. The result on mobile now:

flatsiedatsie commented 1 week ago

I think I may have figured it out!

I was trying to get it to use less memory, and enabling quantization. But this was in fact the problem.

flatsiedatsie commented 1 week ago

Turns out it wasn't that simple :-(

I could get it to run in the minimal variant, but it won't run properly when in the larger whole.

Also, it seems as if something remains in memory that blocks the creation of a new instance, even after the page has been closed. If I manually force-kill Brave's GPU process and the render process, then it gets reset, and I can create a new instance. Otherwise the getInstance promise never gets resolved.

flatsiedatsie commented 1 week ago

Hardcoding the settings seems to have a positive effect 0_0

    if(self.device == 'webgpu'){
              this.instance = pipeline(this.task, this.model_id, {
                  "dtype": {
                      "encoder_model": "fp32",
                      "decoder_model_merged": "q4"
                  },
                  "device": "webgpu",
                  progress_callback
        });
    }
    else{
              this.instance = pipeline(this.task, this.model_id, {
                  "dtype": "q8",
                  "device": "wasm",
                  progress_callback
        });
    }

flatsiedatsie commented 1 week ago

Strange, even WASM fails.

flatsiedatsie commented 1 week ago

I tried reverting back to Alpha 15, but the crash still occurs. Which once again points to my code..

flatsiedatsie commented 1 week ago

I discovered a pattern:

If the tab was closed while the Whisper Worker was running, it won't start properly. The promise is stuck in limbo.
BUT if I then stop whisper manually, it seems to dispose of whatever is available. In details: dispose doesn't work properly, because getting an instance to call dispose on fails. But as a fall-back, if the dispose process still isn't complete a second later, the main code kills the worker. And that seems to be what resets everything properly.
If I then start the Whisper Worker again, it loads!

So it really seems that 'something' is remaining alive after I close the tab.

I then tried to add this to the main code:

window.onbeforeunload = function() {
    console.log("BEFORE UNLOAD");
    if(window.whisper_worker != null){
        console.log("BEFORE UNLOAD:terminating whisper worker");
        window.whisper_worker.terminate();
    }
    return '';
};

To see if I could quickly kill the Whisper Worker when a tab is closed. But that did not seem to have 'unblocked' things when I create a new tab.

flatsiedatsie commented 1 week ago

import { 
    pipeline, 
    env, 
    AutoTokenizer,
    AutoProcessor, 
    AutoModel, 
    AutoModelForAudioFrameClassification,
    WhisperTextStreamer,
    WhisperForConditionalGeneration,
    full,
} from './tjs/transformers.min.js';

const MAX_NEW_TOKENS = 64;

env.allowLocalModels = false;
env.allowRemoteModels = true;
env.useBrowserCache = true;

self.device = 'webgpu';
env.backends.onnx.wasm.proxy = false;

class PipelineFactory {
    static task = null; //"automatic-speech-recognition";
    static model = null; //'onnx-community/whisper-small.en_timestamped';
    static instance = null;

    static model_id = 'onnx-community/whisper-small.en_timestamped';

    constructor(tokenizer, model, quantized) {
        //console.log("in pipelineFactory constructor.  tokenizer, model, quantized: ", tokenizer, model, quantized);
        //console.log("pipelineFactory: in constructor");
        this.tokenizer = tokenizer;
        this.model = model;
    }

    static instance_exists(){
        console.log("returning if instance exists");
        return this.instance != null;
    }

    static set_to_null(var_to_null=null) {
        if(typeof var_to_null == 'string' && typeof this[var_to_null] != 'undefined'){
            this[var_to_null] = null;
            console.log("ASR PipelineFactory: set_to_null: ", var_to_null);
        }
    }

static async getInstance(progress_callback=null, model_id='onnx-community/whisper-small.en_timestamped') {
        console.log("ASR: getInstance: model_id: ", model_id);

        this.model = model_id;
        this.model_id = model_id;

        console.log("\n\npipelineFactory: getInstance");
        console.log("- this.task: ", this.task);
        console.log("- this.model_id: ", this.model_id);
        console.log("- this.model: ", this.model);
        console.log("- self.device: ", self.device);

        if (this.instance === null) {
                console.log("PipelineFactory: this.instance was null, creating pipeline promise");

                if(self.device == 'webgpu'){
                      this.instance = pipeline(this.task, this.model_id, {
                    "dtype": {
                        "encoder_model": "fp32",
                        "decoder_model_merged": "q4" // "fp32"
                    },
                    "device": "webgpu",
                    progress_callback
                });
            }
            else{
                    this.instance = pipeline(this.task, this.model_id, {
                    "dtype": "q8",
                    "device": "wasm",
                    progress_callback
                });
            }

        }
        else{
            console.log("ASR pipeline getInstance: this.instance already existed");
        }

        //console.log("PipelineFactory: returning this.instance: ", this.instance);
        return this.instance;
    }
}

class AutomaticSpeechRecognitionPipelineFactory extends PipelineFactory {
    static task = "automatic-speech-recognition";
    static model = null;
    static quantized = null;
}

const transcribo = async (message,preload=false) => {
    console.log("whisper_worker: in new transcribo function.  message,preload: ", message, preload);

    // Storage for chunks to be processed. Initialise with an empty chunk.
    const chunks = [];
    let output = null;
    let tps;

    try{

        if(typeof message.model != 'string'){
            console.error("transcribe: message.model was not a string!");
            return null;
        }
        console.log("transcribo: message.model: ", message.model);
        self.current_asr_model_id = message.model;

        if(typeof message.options == 'undefined'){
            console.error("transcribe: message.options was undefined!");
            return null;
        }

        let asr_options = JSON.parse(JSON.stringify(message.options));

        console.log("transcribe: initial asr_options: ", asr_options);

        /*
        let asr_options = {
            // Greedy
            top_k: 0,
            do_sample: false,

            // Sliding window
            chunk_length_s:20,
            stride_length_s:3,

            // Language and task
            //language:'en',
            //language:'english',
            //task: "transcribe",

            // Return timestamps
            return_timestamps: 'word',
            force_full_sequences: false,

            // Callback functions
            //streamer, // after each generation step
        }
        */

        const p = AutomaticSpeechRecognitionPipelineFactory;

        if (p.model !== message.model){

            // Invalidate model if different
            console.warn("whisper_worker: need to load a new ASR model: ", message.model);
            p.model = message.model;

            if (p.instance !== null) {
                console.log("whisper_worker: disposing of old ASR instance first");
                (await p.getInstance()).dispose();
                p.instance = null;
            }
        }

        // Load transcribot model
        const transcribot = await p.getInstance((data) => {
            //console.log("whisper_worker: transcribot: got data: ", data);
            self.postMessage(data);
        }, message.model);

        console.warn("\n\nHURRAY, GOT BEYOND TRANSCRIBOT CREATION\n\n");

        //console.log("transcribot loaded?: ", transcribot);
        //console.log("transcribot model: ", transcribot.tokenizer);
        //console.log("transcribot model: ", transcribot.model);
        //console.log("transcribot processor: ", transcribot.processor);

        if(preload){
            /*
            if(self.device == 'webgpu' && typeof transcribot.model == 'object' && transcribot.model != null && typeof transcribot.model.generate === 'function'){
                console.log("transcribot: preloading: attempting to warm-up the transcribot model (transcribot.model.generate is a function)");
                self.postMessage({
                    status: 'asr_warming_up',
                    data: 'Compiling shaders and warming up model...'
                });

                // Run model with dummy input to compile shaders. Only needed if running via WebGPU
                await transcribot.model.generate({
                    input_features: full([1, 80, 3000], 0.0),
                    max_new_tokens: 1,
                });
            }
            */
            console.warn("transcribe: ending early because this was a preload run");
            return true
        }

        if(typeof message.task == 'undefined' || message.task == null || typeof message.task.recorded_audio == 'undefined'){
            console.error("transcribo: NO AUDIO!");
            return null;
        }

        const time_precision =
            transcribot.processor.feature_extractor.config.chunk_length /
            transcribot.model.config.max_source_positions;

        console.log("transcribo: time_precision: ", time_precision);

        // TODO: Storage for fully-processed and merged chunks
        // let decoded_chunks = [];

        let chunk_count = 0;
        let start_time;
        let num_tokens = 0;

        console.log("creating streamer next. transcribot.tokenizer: ", transcribot.tokenizer);

            const streamer = new WhisperTextStreamer(transcribot.tokenizer, {
                time_precision,
                on_chunk_start: (x) => {
                    const offset = (asr_options['chunk_length_s'] - asr_options['stride_length_s']) * chunk_count;
                    chunks.push({
                        text: "",
                        timestamp: [offset + x, null],
                        finalised: false,
                        offset,
                    });
                },
                token_callback_function: (x) => {
                    start_time ??= performance.now();
                    if (num_tokens++ > 0) {
                        tps = (num_tokens / (performance.now() - start_time)) * 1000;
                    }
                },
                callback_function: (x) => {
                    if (chunks.length === 0) return;
                    // Append text to the last chunk
                    chunks.at(-1).text += x;

                    self.postMessage({
                        status: "asr_update",
                        data: {
                            text: "", // No need to send full text yet
                            chunks,
                            tps,
                        },
                    });
                },
                on_chunk_end: (x) => {
                    const current = chunks.at(-1);
                    current.timestamp[1] = x + current.offset;
                    current.finalised = true;
                },
                on_finalize: () => {
                    start_time = null;
                    num_tokens = 0;
                    ++chunk_count;
                },
            });
            asr_options['streamer'] = streamer;

        console.log("asr_options: ", JSON.stringify(asr_options,null,4));

        self.postMessage({ status: 'pipeline_ready' });

        console.error("\n\n\nOK\n\n\n\nWHISPER: AUDIO LENGTH: ", message.task.recorded_audio.length);
        //console.error("WHISPER AUDIO: ", message.task.recorded_audio);

        // Actually run transcription
        output = await transcribot(message.task.recorded_audio, asr_options).catch((error) => {
            console.error("caught error in transcribot: ", error);
            self.postMessage({
                status: "error",
                data: error,
            });
            return null;
        });

        console.log("whisper_worker: RAW ASR output: ", output);

    }
    catch(err){
        console.error("caught error in transcribe: ", err);
    }

    return {
        tps,
        ...output,
        chunks,
    };
};

flatsiedatsie commented 1 week ago

Hmm, on desktop, the minimal test version does run every time without issue. But only if I don't also have another tab open with a frozen Whisper Worker.

So a frozen Whisper Worker in tab A also blocks loading whisper in a new Whisper Worker in tab B.

This would imply that something in my project's code is creating the condition that freezes up the loading process in such a hardcore way that even other tabs are affected.

// Hmm, but everything works fine the first time, when the browser is fresh (or after manually killing the related processes... or killing the worker and then restarting it again...). So that would imply something gets set on the first load of Whisper that interferes with itself after a page reload.

But that issue doesn't happen in the minimal test, so something in my code/situation is causing Whisper's worker to not properly die/unload/get cleaned after the page is closed.

// The minimal version immediately crashes on mobile.

flatsiedatsie commented 1 week ago

By Jove, I think I may have cracked the case!

flatsiedatsie commented 1 week ago

I've solved one of the issues.

It turns out that I was sending data to the Whisper Worker right after it was created. But the worker wasn't actually 'loaded in' at that point yet.

So I've added this to the end of the worker script:

self.postMessage({
    status: "exists"
});

Only once the main script has received the "exists" message will it send the worker the data. So, it was an issue in my code (as was the most likely) that slipped in during the rewrite.

The delay has solved things on the desktop side. It now works perfectly again.

On the mobile side, however, this fix unfortunately hasn't resolved the issue. There I still see:

For now I'm assuming that's an indication of a 'real' out of memory issue. And now I'm back to trying (adding) variations of enabling quantized and/or changing Q4 to FP32 for the encoder.

It was working fine on mobile for a long time, so I know it's possible.

flatsiedatsie commented 1 week ago

✅ Chrome DEV works!
✅ Mobile Firefox works (on CPU)
✅ Mobile Brave works (on GPU)
✅ Mobile Edge works
✅ Mobile Vivaldi works
✅ Mobile DuckDuckGo works. Interestingly, it claims to download a real tiny 42Mb Whisper model (normally its 117MB). Not sure if it's running on CPU or GPU.
⛔ Mobile Chrome, as explored endlessly above, fails.
⛔ Mobile Samsung browser seems to be doing something, but the output is invalid. I'll have to debug it to know more.
⛔ Mobile Samsung BETA browser is the same, plus the VAD also acts weird.
⛔ Mobile Chrome Vanadium:

and

flatsiedatsie commented 1 week ago

Tested on another Android device, a Samsung tablet with just 2GB of RAM.

⛔ Chrome Mobile once again crashed.
✅ In Brave Mobile voice recognition worked! It loaded the 42MB model, which might be the CPU model?

Also tested on an iPhone SE 2020, with 3GB ram.

Voice recognition does not work.
The biggest AI I can run via Wllama is a 160m model..

flatsiedatsie commented 1 week ago

Since it seems to be a combinations of an issue in my code and -seemingly- an issue with mobile Chrome I'm going to close this issue.

Phew!

flatsiedatsie commented 6 hours ago

I've noticed that in general it's a good idea to create a 'preload' function, where the model first downloads its files, and after that is done, sends a "preloaded" mesage back to the main thread, and only then gets sent the actual tasks.

I don't know why this is. It could be that it adds a small delay? But I've now added this to the TTS worker too, and it seems much happier.

Still, sporadically a worker will still freeze. I've resorted to adding code that checks if it has taken more than 15 seconds for the worker to create an instance. If the main thread doesn't get a success mesage within that timeframe, the worker will be terminated. This.. isn't pretty.

huggingface / transformers.js

Zombies in memory - something is blocking (re)loading of Whisper after a page is closed and re-opened #958

Question