Open jozefchutka opened 1 year ago
I believe this is due to how default HF spaces are hosted (which block usage of SharedArrayBuffer). Here's another thread discussing this: https://github.com/microsoft/onnxruntime/issues/9681#issuecomment-968380663. I would be interested in seeing what performance benefits we could get though. cc'ing @fs-eire @josephrocca for some help too.
good spot! Simply adding:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
...will indeed invoke wasm-simd-threaded.wasm. However I do not see multiple workers spawned (as I would expect) nor any performance improvements.
@jozefchutka So, to confirm, it loaded the wasm-simd-threaded.wasm
file but didn't spawn any threads? Can you check in the console whether self.crossOriginIsolated
is true
?
If it is true, then can you check the value of env.backends.onnx.wasm.numThreads
?
Regarding Hugging Face spaces, I've opened an issue here for COEP/COOP header support: https://github.com/huggingface/huggingface_hub/issues/1525
And in the meantime you can use the service worker hack on Spaces, mentioned here: https://github.com/orgs/community/discussions/13309#discussioncomment-3844940
an easy way to check is to open devtool on that page, and check if
( OK, checking self.crossOriginIsolated is the better way, good learning for me :) )typeof SharedArrayBuffer
is undefined.
if multi-thread features are available, ort-web will spawn [CPU-core-number / 2] (up-to 4) threads, if <ORT_ENV_OBJ>.wasm.numThreads
is not set.
if <ORT_ENV_OBJ>.wasm.numThreads
is set, use that number to spawn workers (main thread counts). setting to 1 will force disable multi-thread feature.
@josephrocca , @fs-eire following is printed: self.crossOriginIsolated -> true env.backends.onnx.wasm.numThreads -> undefined
I have also tried to explicitly set numThreads to 4 but same result.
Something interesting to mention:
numThreads=0
), however I wonder .... why it appears so late? and why no performance difference?env.backends.onnx.wasm.numThreads -> undefined
for onnxruntime-web, it's env.wasm.numThreads
:
import { env } from 'onnxruntime-web';
env.wasm.numThreads = 4;
for transformers.js I believe it is exposed through different way
I believe @jozefchutka was doing it correctly:
env.backends.onnx.wasm.numThreads
I.e., env.backends.onnx
is the onnx env variable
env.backends.onnx.wasm
object exists env
object... so I think its the right one...
please let me know if I can assist/debug any further
I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔 ort-wasm-simd-threaded.wasm
is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that if SharedArrayBuffer
is detected, it will default to the number of threads available? So, even in this case, setting env.backends.onnx.wasm.numThreads
(which is correct) would not be necessary, and we should be seeing performance improvements either way.
I've been able to do some more testing on this and I am not seeing any performance improvements either... 🤔
ort-wasm-simd-threaded.wasm
is being loaded, but doesn't seem to be working correctly. @fs-eire am I correct in saying that ifSharedArrayBuffer
is detected, it will default to the number of threads available? So, even in this case, settingenv.backends.onnx.wasm.numThreads
(which is correct) would not be necessary, and we should be seeing performance improvements either way.
The code is here: https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/backend-wasm.ts#L29-L32
this code run only once when trying to create the first inference session
Right, so we should be seeing a performance improvement by simply having loaded ort-wasm-simd-threaded.wasm
?
Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.
Yes, this is something mentioned above by @jozefchutka:
Something interesting to mention:
- I use dev tools / sources / threads to observe running threads, where I see my index.html and my worker.js (which impots transformers.js), nothing else reported until!...
- ...until a very last moment where transformers.js transcription finishes and then I can see 3 more threads appearing, I believe these has something todo with transformers running in threads (b/c these do not appear when I set
numThreads=0
), however I wonder .... why it appears so late? and why no performance difference?
transformers.js does not do anything extra when it comes to threading, so I do believe this is an issue with onnxruntime-web. Please let me know if there's anything I can do to help debug/test
@xenova Unless I misunderstand, you or @jozefchutka might need to provide a minimal example here. I don't see the problem of worker threads appearing too late (i.e. after inference) in this fairly minimal demo, for example:
https://josephrocca.github.io/openai-clip-js/onnx-image-demo.html
That's using the latest ORT Web version, and has self.crossOriginIsolated==true
0. I see ort-wasm-simd-threaded.wasm
load in the network tab, and worker threads immediately appear - before inference happens.
Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy
flag to get the model off the main thread "automatically".
[0] Just a heads-up: For some reason I had to manually refresh the page the first time I loaded it just now - the service worker that adds the COOP/COEP headers didn't refresh automatically like it's supposed to.
if you use ort.env.wasm.proxy flag, the proxy worker will be spawn immediately. this is different worker to the workers created for multithread computation
Should we see performance improvements even if the batch size is 1? Could you maybe explain how work is divided among threads @fs-eire?
Regarding a demo, @jozefchutka would you mind sharing the code you were referring to above? My testing was done inside my whisper-web application, which is quite large and had a lot of bloat around it.
Here is a demo:
worker.js
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.4.1/dist/transformers.min.js";
env.allowLocalModels = false;
//env.backends.onnx.wasm.numThreads = 4;
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const t0 = performance.now();
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
for(let {text, timestamp} of result.chunks)
console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);
console.log(performance.now() - t0);
demo.html
<script>
new Worker("worker.js", {type:"module"});
</script>
a script to generate .pcm file:
ffmpeg -i tos.mp4 -filter_complex [0:1]aformat=channel_layouts=mono,aresample=16000[aout] -map [aout] -c:a pcm_f32le -f data tos.pcm
changing value of env.backends.onnx.wasm.numThreads
makes no difference in transcription performance of tested 1 minute long pcm.
Yes. if you see it is loaded but no worker threads spawn, that is likely to be a bug.
@fs-eire Any updates on this maybe? 😅 Is there perhaps an issue with spawning workers from a worker itself? Here's a 60-second audio file for testing, if you need it: ted_60.wav
@jozefchutka Can you maybe test with @josephrocca's previous suggestion? env.backends.onnx.wasm.proxy=true
Edit: Oooh, unless this is something that specifically occurs when ORT Web is loaded from within a web worker? I haven't tested that yet, since I've just been using use the ort.env.wasm.proxy flag to get the model off the main thread "automatically".
@xenova , I observe no difference in performance or extra threads/workers running when tested with env.backends.onnx.wasm.proxy=true
@jozefchutka Did you try not using a worker.js file, and just keeping all transformers.js logic in the UI thread (but still using proxy=true).
This is a version without my worker test.html:
<script type="module">
import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0/dist/transformers.min.js";
env.allowLocalModels = false;
env.backends.onnx.wasm.proxy = true;
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const t0 = performance.now();
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
for(let {text, timestamp} of result.chunks)
console.log(`${timestamp[0]} -> ${timestamp[1]} ${text}`);
console.log(performance.now() - t0);
</script>
With this script, I can see 4 workers opened, however await pipeline()
is never resolved and the script basically hangs on that line. Can you please have a look?
await pipeline() is never resolved
Are you sure it's not just downloading the model? Can you check your network tab?
I'll test this though.
I've done a bit of benchmarking and there does not seem to be any speedup when using threads. url: https://xenova-whisper-testing.hf.space/ consistently takes 3.8 seconds. I do see the threads spawn though.
Also, using the proxy just freezes everything after spawning 6 threads.
@jozefchutka am I missing something? Is this also what you see? @fs-eire I am still using onnxruntime-web v1.14.0 - is this something which was fixed in a later release?
@xenova thats same as what I have observed
I just tried this with a simple app and works fine for me. Let me try with transformers.js next. As long you see ort-wasm-simd-threaded.wasm loading it should work. For testing you can add --enable-features=SharedArrayBuffer to the chrome command line to rule out any COEP/COOP issue.
I just tried this with a simple app and works fine for me.
Do you see speedups too? 👀
As long you see ort-wasm-simd-threaded.wasm loading it should work.
@guschmue It does seem to load this file when running this demo, but no performance improvements (all 3.7 seconds)
I am still using v1.14.0, so if something changed since then, I can update and check
While looking into https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.js I can see a reference to ort-wasm-simd-threaded.wasm however that one never seem to be loaded for whisper/automatic-speech-recognition ( https://huggingface.co/spaces/Xenova/whisper-web ) while it always use ort-wasm-simd.wasm . I wonder if there is a way to enable or enforce threaded wasm and so improve transcription speed?