Open gianlourbano opened 2 months ago
For WebGPU EP, the problem is related to op unsqueeze. According the ONNX spec (https://onnx.ai/onnx/operators/onnx__Unsqueeze.html), axes of unsqueeze is a list of integers, but in your model, it's just a scalar "1".
So the problem is related to the dynamo export of torch?
Technically the axes should always be a 1D tensor. However, in reality, the CPU code has loosen the limit:
perhaps webgpu should have same behavior to CPU.
@gyagp with latest 1.20.0-dev.20240917-afd642a194, that should include both fixes, i still cannot run the model in webgpu, the runtime just aborts after displaying the wgpu experimental warning
I also hit some issue with the latest code, and I will take a further look. BTW, I manually modified the model to work around the unsqueeze issue before, and it seems that model can run. I uploaded it to https://huggingface.co/webai-community/models/tree/main (click "download file" after demucs.onnx).
Your model succesfully runs with latest @dev, with timings (60s of audio with 10s chunks):
wasm: step 0: 12656 ms step 1: 12864 ms step 2: 13211 ms step 3: 13164 ms step 4: 13643 ms step 5: 13687 ms
wgpu: step 0: 10226 ms step 1: 9612 ms step 2: 9628 ms step 3: 9647 ms step 4: 9600 ms step 5: 9562 ms
onnx python cpu: step 0: 4.9 s step 1: 4.9 s step 2: 4.6 s step 3: 4.9 s step 4: 4.8 s step 5: 4.6 s
On ryzen 4600H
I have also tried on a macbook m1 pro with an average wgpu step of ~2.8s
@gyagp After implementening pre and post processing for the demuxing of a whole track, i have noticed that Wgpu outputs are way different from wasm ones. In wasm, the model works as expected, while in gpu the stems are all mixed up, apart from the bass one: i suspect the lower frequencies are preserved while with the higher ones something strange happens. Maybe an error in some kernel?
If you want i can upload somewhere the stems of a 10s chunks for wasm/wpgu inference, to see the difference. I'm certain the problem is not with the pre/post processing, as the outputs of the model with the two backends are different.
Also, any update on the MatMul problem?
Sorry to hear that you got different results from wasm and WebGPU. If you may upload your case somewhere, I can take a look next week. What's the MatMul problem?
I'll update the sample repo in this issue so that it computes on the same random arrays both on wasm and wgpu, to demonstrate that the outputs are different based on the backend used.
The matmul problem is the one you mentioned here, i.e the performance of the wgpu model is not that great
Ah, sorry that it's a bit buried by other tasks. I will ask someone from my team to look into it next week.
Thank you very much @qjia7 ! On my macbook pro m1 the step is now 1.9s from 2.8s. I'm still seeing wrong outputs for the model in wgpu, while on wasm it works fine. If you want i can upload some stems to the sample repo so you can see the difference
@gianlourbano Will look at the wrong outputs issue. And the optimization isn't over yet. There are still several places that need to be optimized.
@gianlourbano I did a debug for this model. The incorrect result is because the MatMul
shader key is not unique which results the wrong compute pipeline is loaded. PR #22536 may fix the issue. You can have a try once this PR is merged.
@qjia7 I have tried with the pr for the matmul cache and all other optimizations, i can confirm that the results are now correct in wgpu, with a step of just 1.8s on my Macbook M1 Pro. Thank you very much! Are there any other optimizations planned?
@gianlourbano How long is the audio duration in the example you ran on the M1 Pro that takes 1.8 seconds?
It's always a chunk of 10s
@gianlourbano https://github.com/gianlourbano/demucs-onnx?tab=readme-ov-file I tried to run your project but it was not successful. Could you please publish the latest invocation example? Thank you very much.
@qjia7 I have tried with the pr for the matmul cache and all other optimizations, i can confirm that the results are now correct in wgpu, with a step of just 1.8s on my Macbook M1 Pro. Thank you very much! Are there any other optimizations planned?
Thanks for your patience. Another two optimizations: 1) ConvTranspose 2) Transpose are still on the plan. After that, I suppose this model is in a good state.
@qjia7 Looking forward very much
Hi @qjia7, i have tried the ConvTranspose PR and i'm down to 1.4s. I have noticed a problem: on Chrome 130, which i updated yesterday, my whole pipeline is way slower (both wav2vec2 and demucs, the latter taking almost 3.6s per step), while on Chrome 129 everything runs smoothly.
Hi @qjia7, i have tried the ConvTranspose PR and i'm down to 1.4s. I have noticed a problem: on Chrome 130, which i updated yesterday, my whole pipeline is way slower (both wav2vec2 and demucs, the latter taking almost 3.6s per step), while on Chrome 129 everything runs smoothly.
Can you have a try with Chrome Canary (You may download it from https://www.chromium.org/getting-involved/dev-channel/)?
On Canary it's still slower, but not as much: 2.4s. With the Unsafe WebGPU flag on it goes as slow as 4s
Edit: i have noticed it's the same thing with Chrome 130
Just want to double check, are you on M1 pro? Can you share your app so that we can repro the issue? Or you may help to bisect the issue by following the instructions at https://www.chromium.org/developers/bisect-builds-py.
Yes, i'm on Mac M1 Pro. I'm using the sample repo, with a locally modified version of onnxruntime-web with the prs that are yet to be merged. I will try to bisect the build
BTW, it's not required to add unsafe WebGPU flag. But it's interesting to know it slows down your app so much.
Yes, i'm on Mac M1 Pro. I'm using the sample repo, with a locally modified version of onnxruntime-web with the prs that are yet to be merged. I will try to bisect the build
Thanks for your effort!
I always turn it on because on linux it's required for webgpu to work. Always worker until now :)
Some team members reminded me this morning that a big change related to WebGPU in Chrome M130 is it enables IR on macOS. Can you simply add option "--disable-dawn-features=use_tint_ir" to your Chrome (to disable IR) and see if the performance could recover?
@gianlourbano @gyagp I can reproduce the regression with M1 Pro. The bisect result is as below: You are probably looking for a change made after 1369201 (known good), but no later than 1369208 (first known bad). CHANGELOG URL: https://chromium.googlesource.com/chromium/src/+log/83be517491dda7a23d316065c395d0f0ad584bc1..fb88b76548ad31707ee4fd433fc63419edf48a29
Roll Dawn from b4f991e7eb6e to 0b31a6ca843a (40 revisions)
https://dawn.googlesource.com/dawn.git/+log/b4f991e7eb6e..0b31a6ca843a
2024-10-16 kainino@chromium.org Add more flake expectations
2024-10-16 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 76025caa1a05 to 4a2f9b1ca432 (6 revisions)
2024-10-16 kainino@chromium.org Add flake expectations for pixel 4
2024-10-16 jimblackler@google.com [kotlin] export StringView ToKotlin for callback params in methods.cpp
2024-10-16 lokokung@google.com [dawn][emscripten] Updates callbacks to use StringView.
2024-10-16 lokokung@google.com [dawn][emscripten] Update implementation to handle StringView for inputs.
2024-10-16 jrprice@google.com [ir] Disallow access with no indices
2024-10-16 jrprice@google.com [spirv-reader] Avoid creating access with no indices
2024-10-16 jrprice@google.com DirectVariableAccess: Avoid creating access with no indices
2024-10-15 jrprice@google.com [hlsl] Fix f16 vector element stores in storage
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from 8f346c5caf5a to 4c2208c976c8 (15 revisions)
2024-10-15 jrprice@google.com [glsl] Add fuzzer for IR generator
2024-10-15 cwallez@chromium.org [dawn][webgpu.h] Remove deprecated const char* entrypoints
2024-10-15 dneto@google.com [msl ir] Convince the Metal compiler loops are never infinite
2024-10-15 beaufort.francois@gmail.com Add float32-blendable feature
2024-10-15 chrome-branch-day@chops-service-accounts.iam.gserviceaccount.com Activate dawn M131
2024-10-15 shaobo.yan@intel.com Dawn native/vulkan: PipelineLayoutVk holds multiple VkPipelineLayouts
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll SwiftShader from 7a9a492a38b7 to 74b783dffb9b (1 revision)
2024-10-15 ynovikov@chromium.org Suppress flaky WebGPU CTS compat test on Android ARM
2024-10-15 dneto@google.com Convince the metal compiler that loops are never infinite
2024-10-15 cwallez@chromium.org [dawn][generator] Sort the Python dependencies for the .d files
2024-10-15 cwallez@chromium.org [dawn][webgpu.h] Remove webgpu_cpp.h's use of memcpy.
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 367e9e74a865 to 76025caa1a05 (3 revisions)
2024-10-15 jimblackler@google.com [kotlin] Remove need for @get:JvmName annotation in enum classes
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from a9a924e1ca9b to 367e9e74a865 (3 revisions)
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll DirectX Shader Compiler from 080aeb7199e6 to 26d7dd984b2b (1 revision)
2024-10-15 lokokung@google.com [dawn][emscripten] Implements getCompilationInfo future entry point.
2024-10-14 cwallez@chromium.org StringViewUtils.cpp: Add include for std::strlen.
2024-10-14 nickchavez@google.com Fixes the Quickstart With CMake guide to use webgpu_cpp_print.h
2024-10-14 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from e0070499f409 to 8f346c5caf5a (1 revision)
2024-10-14 cwallez@chromium.org [dawn][webgpu.h] Use StringView in callback arguments.
2024-10-14 jimblackler@google.com Convert C output params to Kotlin return type for void methods.
2024-10-14 jimblackler@google.com Update test following API change.
2024-10-14 cwallez@chromium.org [dawn][graphite] Add a check for MTLFunction being nil.
2024-10-14 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 78a694a1b82a to a9a924e1ca9b (1 revision)
2024-10-13 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from e7f0d107f258 to 78a694a1b82a (1 revision)
2024-10-13 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from ab901eb0f984 to e0070499f409 (1 revision)
2024-10-12 chrome-automated-expectation@chops-service-accounts.iam.gserviceaccount.com Remove stale WebGPU CTS expectations
2024-10-12 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from a8d9d8138307 to e7f0d107f258 (1 revision)
2024-10-12 amaiorano@google.com HLSL-IR: Fix texture Sample and SampleLevel return type on depth textures
We need to file an issue to chromium/dawn about this regression. I also tried "--disable-dawn-features=use_tint_ir"
. It seems not related with the regression.
fyi @Jiawei-Shao @Kangz See ~3x regressions for demucs model. See https://github.com/microsoft/onnxruntime/issues/22031#issuecomment-2464526306
It seems like a MacOS specific issue. Windows looks good. In this case, except dawn, 81951149 Roll Chrome Mac Arm PGO Profile by chromium-autoroll is also suspectable.
@gianlourbano As we already bisected and got the suspicious changes, your effort is no longer needed. We will further understand the root cause and work with upstream to fix the issue. Thanks for reporting the regression and stay tuned.
@gianlourbano, @qjia7 helped to submit the issue to Chromium at https://issues.chromium.org/issues/379009123, and you may follow the status there. Google is taking care of this issue and so far the analysis led to a regression while introducing the Tint IR.
@gyagp @qjia7 Thank you very much for your help!
@gyagp @qjia7 Hello, I am very interested in running demucs through onnxruntime-web in the browser, but I have no knowledge of machine learning. I tried to read tutorials on how to export PyTorch models as ONNX models, but I didn't quite understand it. How should I construct the input parameters for exporting? So, may I ask if there is a plan for the official team to upload the already converted ONNX model to https://github.com/onnx/models? I think this would make it more convenient for everyone to use. Thank you very much.
Hello @asasas234,this is a modified version of the demucs model that fits my needs. Note that converting the original model from pytorch to onnx took quite a bit, because not all operators were at the time supported by the new dynamo export. Removing these operators made the model successfully convert, but they needed to be implemented elsewhere (e.g,the stft is moved from the beginning of the model to wasm). I suggest to try and convert it again given that the related packages (onnxscript and the dynamo export of torch) are continuously updated and support for new operators may always come
@gianlourbano Thank you, I tried to run your project demucs-onnx before, but it didn't convert successfully. I'm not very familiar with this technology, but it seems that onnx supports running demucs, so I want to seek official support.
@asasas234 demucs-onnx is actually just a sample repo to run the model with random data, all of the removed operators from the models and the pre/post processing are missing, so it cant actually be used to demux real audios. I cannot share my implementation as its private. Note also that you need to install the latest @dev of onnxruntime-web to run the model
Describe the issue
I converted the model from pytorch to onnx as described here, with some issues. The model works in onnx python, but in wasm /webgpu the runtime dies without error. The optimized version of the model runs in wasm, but not webgpu. I don't know if this problem is related to the model conversion or the runtime. I have tested with both @latest and @dev.
To reproduce
Here's a link to a sample repo, instructions in README.
Urgency
Urgent, as this project is related to my thesis
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.2, 1.20.0-dev.20240907-ad9afbb042
Execution Provider
'wasm'/'cpu' (WebAssembly CPU), 'webgpu' (WebGPU)