[Question] whisper vs. ort-wasm-simd-threaded.wasm

jozefchutka commented 1 year ago

While looking into https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.js I can see a reference to ort-wasm-simd-threaded.wasm however that one never seem to be loaded for whisper/automatic-speech-recognition ( https://huggingface.co/spaces/Xenova/whisper-web ) while it always use ort-wasm-simd.wasm . I wonder if there is a way to enable or enforce threaded wasm and so improve transcription speed?

xenova commented 1 year ago

I believe this is due to how default HF spaces are hosted (which block usage of SharedArrayBuffer). Here's another thread discussing this: https://github.com/microsoft/onnxruntime/issues/9681#issuecomment-968380663. I would be interested in seeing what performance benefits we could get though. cc'ing @fs-eire @josephrocca for some help too.

jozefchutka commented 1 year ago

good spot! Simply adding:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

...will indeed invoke wasm-simd-threaded.wasm. However I do not see multiple workers spawned (as I would expect) nor any performance improvements.

josephrocca commented 1 year ago

@jozefchutka So, to confirm, it loaded the wasm-simd-threaded.wasm file but didn't spawn any threads? Can you check in the console whether self.crossOriginIsolated is true?

If it is true, then can you check the value of env.backends.onnx.wasm.numThreads?

Regarding Hugging Face spaces, I've opened an issue here for COEP/COOP header support: https://github.com/huggingface/huggingface_hub/issues/1525

And in the meantime you can use the service worker hack on Spaces, mentioned here: https://github.com/orgs/community/discussions/13309#discussioncomment-3844940

fs-eire commented 1 year ago

~~an easy way to check is to open devtool on that page, and check if typeof SharedArrayBuffer is undefined.~~ ( OK, checking self.crossOriginIsolated is the better way, good learning for me :) )

if multi-thread features are available, ort-web will spawn [CPU-core-number / 2] (up-to 4) threads, if <ORT_ENV_OBJ>.wasm.numThreads is not set.

if <ORT_ENV_OBJ>.wasm.numThreads is set, use that number to spawn workers (main thread counts). setting to 1 will force disable multi-thread feature.

jozefchutka commented 1 year ago

@josephrocca , @fs-eire following is printed: self.crossOriginIsolated -> true env.backends.onnx.wasm.numThreads -> undefined

I have also tried to explicitly set numThreads to 4 but same result.

Something interesting to mention:

I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...
...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

fs-eire commented 1 year ago

env.backends.onnx.wasm.numThreads -> undefined

for onnxruntime-web, it's env.wasm.numThreads:

import { env } from 'onnxruntime-web';
env.wasm.numThreads = 4;

for transformers.js I believe it is exposed through different way

xenova commented 1 year ago

I believe @jozefchutka was doing it correctly:

env.backends.onnx.wasm.numThreads

I.e., env.backends.onnx is the onnx env variable

jozefchutka commented 1 year ago

env.backends.onnx.wasm object exists env object... so I think its the right one... please let me know if I can assist/debug any further

xenova commented 1 year ago

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

fs-eire commented 1 year ago

I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads (which is correct) would not be necessary, and we should be seeing performance improvements either way.

The code is here: https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/backend-wasm.ts#L29-L32

this code run only once when trying to create the first inference session

xenova commented 1 year ago

Right, so we should be seeing a performance improvement by simply having loaded ort-wasm-simd-threaded.wasm?

fs-eire commented 1 year ago

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

xenova commented 1 year ago

Yes, this is something mentioned above by @jozefchutka:

Something interesting to mention:

I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...

...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set numThreads=0), however I wonder .... why it appears so late? and why no performance difference?

transformers.js does not do anything extra when it comes to threading, so I do believe this is an issue with onnxruntime-web. Please let me know if there's anything I can do to help debug/test

josephrocca commented 1 year ago

@xenova Unless I misunderstand, you or @jozefchutka might need to provide a minimal example here. I don't see the problem of worker threads appearing too late (i.e. after inference) in this fairly minimal demo, for example:

https://josephrocca.github.io/openai-clip-js/onnx-image-demo.html

That's using the latest ORT Web version, and has self.crossOriginIsolated==true⁰. I see ort-wasm-simd-threaded.wasm load in the network tab, and worker threads immediately appear - before inference happens.

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".

[0] Just a heads-up: For some reason I had to manually refresh the page the first time I loaded it just now - the service worker that adds the COOP/COEP headers didn't refresh automatically like it's supposed to.

fs-eire commented 1 year ago

if you use ort.env.wasm.proxy flag, the proxy worker will be spawn immediately. this is different worker to the workers created for multithread computation

xenova commented 1 year ago

Should we see performance improvements even if the batch size is 1? Could you maybe explain how work is divided among threads @fs-eire?

Regarding a demo, @jozefchutka would you mind sharing the code you were referring to above? My testing was done inside my whisper-web application, which is quite large and had a lot of bloat around it.

jozefchutka commented 1 year ago

Here is a demo:

worker.js

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.4.1/dist/transformers.min.js";

env.allowLocalModels = false;
//env.backends.onnx.wasm.numThreads = 4;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
    chunk_length_s: 30,
    stride_length_s: 5,
    return_timestamps: true});

for(let {text, timestamp} of result.chunks)
    console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);

demo.html

<script>
new Worker("worker.js", {type:"module"});
</script>

a script to generate .pcm file:

ffmpeg -i tos.mp4 -filter_complex [0:1]aformat=channel_layouts=mono,aresample=16000[aout] -map [aout] -c:a pcm_f32le -f data tos.pcm

changing value of env.backends.onnx.wasm.numThreads makes no difference in transcription performance of tested 1 minute long pcm.

xenova commented 1 year ago

Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.

@fs-eire Any updates on this maybe? 😅 Is there perhaps an issue with spawning workers from a worker itself? Here's a 60-second audio file for testing, if you need it: ted_60.wav

xenova commented 1 year ago

@jozefchutka Can you maybe test with @josephrocca's previous suggestion? env.backends.onnx.wasm.proxy=true

Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".

jozefchutka commented 1 year ago

@xenova , I observe no difference in performance or extra threads/workers running when tested with env.backends.onnx.wasm.proxy=true

xenova commented 1 year ago

@jozefchutka Did you try not using a worker.js file, and just keeping all transformers.js logic in the UI thread (but still using proxy=true).

jozefchutka commented 1 year ago

This is a version without my worker test.html:

<script type="module">
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.proxy = true;

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);

const t0 = performance.now();
const result = await pipe(buffer, {
    chunk_length_s: 30,
    stride_length_s: 5,
    return_timestamps: true});

for(let {text, timestamp} of result.chunks)
    console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);

console.log(performance.now() - t0);
</script>

With this script, I can see 4 workers opened, however await pipeline() is never resolved and the script basically hangs on that line. Can you please have a look?

xenova commented 1 year ago

await pipeline() is never resolved

Are you sure it's not just downloading the model? Can you check your network tab?

I'll test this though.

xenova commented 1 year ago

I've done a bit of benchmarking and there does not seem to be any speedup when using threads. url: https://xenova-whisper-testing.hf.space/ consistently takes 3.8 seconds. I do see the threads spawn though.

Also, using the proxy just freezes everything after spawning 6 threads.

@jozefchutka am I missing something? Is this also what you see? @fs-eire I am still using onnxruntime-web v1.14.0 - is this something which was fixed in a later release?

jozefchutka commented 1 year ago

@xenova thats same as what I have observed

guschmue commented 1 year ago

I just tried this with a simple app and works fine for me. Let me try with transformers.js next. As long you see ort-wasm-simd-threaded.wasm loading it should work. For testing you can add --enable-features=SharedArrayBuffer to the chrome command line to rule out any COEP/COOP issue.

xenova commented 1 year ago

I just tried this with a simple app and works fine for me.

Do you see speedups too? 👀

As long you see ort-wasm-simd-threaded.wasm loading it should work.

@guschmue It does seem to load this file when running this demo, but no performance improvements (all 3.7 seconds)

I am still using v1.14.0, so if something changed since then, I can update and check

huggingface / transformers.js

[Question] whisper vs. ort-wasm-simd-threaded.wasm #161