microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.78k stars 2.94k forks source link

[Web] Demucs model won't run in both WASM and WGPU #22031

Open gianlourbano opened 2 months ago

gianlourbano commented 2 months ago

Describe the issue

I converted the model from pytorch to onnx as described here, with some issues. The model works in onnx python, but in wasm /webgpu the runtime dies without error. The optimized version of the model runs in wasm, but not webgpu. I don't know if this problem is related to the model conversion or the runtime. I have tested with both @latest and @dev.

To reproduce

Here's a link to a sample repo, instructions in README.

Urgency

Urgent, as this project is related to my thesis

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2, 1.20.0-dev.20240907-ad9afbb042

Execution Provider

'wasm'/'cpu' (WebAssembly CPU), 'webgpu' (WebGPU)

gyagp commented 2 months ago

For WebGPU EP, the problem is related to op unsqueeze. According the ONNX spec (https://onnx.ai/onnx/operators/onnx__Unsqueeze.html), axes of unsqueeze is a list of integers, but in your model, it's just a scalar "1".

gianlourbano commented 2 months ago

So the problem is related to the dynamo export of torch?

fs-eire commented 2 months ago

Technically the axes should always be a 1D tensor. However, in reality, the CPU code has loosen the limit:

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62

perhaps webgpu should have same behavior to CPU.

22054

gianlourbano commented 2 months ago

@gyagp with latest 1.20.0-dev.20240917-afd642a194, that should include both fixes, i still cannot run the model in webgpu, the runtime just aborts after displaying the wgpu experimental warning

gyagp commented 2 months ago

I also hit some issue with the latest code, and I will take a further look. BTW, I manually modified the model to work around the unsqueeze issue before, and it seems that model can run. I uploaded it to https://huggingface.co/webai-community/models/tree/main (click "download file" after demucs.onnx).

gianlourbano commented 2 months ago

Your model succesfully runs with latest @dev, with timings (60s of audio with 10s chunks):

wasm: step 0: 12656 ms step 1: 12864 ms step 2: 13211 ms step 3: 13164 ms step 4: 13643 ms step 5: 13687 ms

wgpu: step 0: 10226 ms step 1: 9612 ms step 2: 9628 ms step 3: 9647 ms step 4: 9600 ms step 5: 9562 ms

onnx python cpu: step 0: 4.9 s step 1: 4.9 s step 2: 4.6 s step 3: 4.9 s step 4: 4.8 s step 5: 4.6 s

On ryzen 4600H

gianlourbano commented 2 months ago

I have also tried on a macbook m1 pro with an average wgpu step of ~2.8s

gianlourbano commented 1 month ago

@gyagp After implementening pre and post processing for the demuxing of a whole track, i have noticed that Wgpu outputs are way different from wasm ones. In wasm, the model works as expected, while in gpu the stems are all mixed up, apart from the bass one: i suspect the lower frequencies are preserved while with the higher ones something strange happens. Maybe an error in some kernel?

If you want i can upload somewhere the stems of a 10s chunks for wasm/wpgu inference, to see the difference. I'm certain the problem is not with the pre/post processing, as the outputs of the model with the two backends are different.

Also, any update on the MatMul problem?

gyagp commented 1 month ago

Sorry to hear that you got different results from wasm and WebGPU. If you may upload your case somewhere, I can take a look next week. What's the MatMul problem?

gianlourbano commented 1 month ago

I'll update the sample repo in this issue so that it computes on the same random arrays both on wasm and wgpu, to demonstrate that the outputs are different based on the backend used.

The matmul problem is the one you mentioned here, i.e the performance of the wgpu model is not that great

gyagp commented 1 month ago

Ah, sorry that it's a bit buried by other tasks. I will ask someone from my team to look into it next week.

gianlourbano commented 3 weeks ago

Thank you very much @qjia7 ! On my macbook pro m1 the step is now 1.9s from 2.8s. I'm still seeing wrong outputs for the model in wgpu, while on wasm it works fine. If you want i can upload some stems to the sample repo so you can see the difference

qjia7 commented 3 weeks ago

@gianlourbano Will look at the wrong outputs issue. And the optimization isn't over yet. There are still several places that need to be optimized.

qjia7 commented 3 weeks ago

@gianlourbano I did a debug for this model. The incorrect result is because the MatMul shader key is not unique which results the wrong compute pipeline is loaded. PR #22536 may fix the issue. You can have a try once this PR is merged.

gianlourbano commented 2 weeks ago

@qjia7 I have tried with the pr for the matmul cache and all other optimizations, i can confirm that the results are now correct in wgpu, with a step of just 1.8s on my Macbook M1 Pro. Thank you very much! Are there any other optimizations planned?

asasas234 commented 2 weeks ago

@gianlourbano How long is the audio duration in the example you ran on the M1 Pro that takes 1.8 seconds?

gianlourbano commented 2 weeks ago

It's always a chunk of 10s

asasas234 commented 2 weeks ago

@gianlourbano https://github.com/gianlourbano/demucs-onnx?tab=readme-ov-file I tried to run your project but it was not successful. Could you please publish the latest invocation example? Thank you very much.

qjia7 commented 2 weeks ago

@qjia7 I have tried with the pr for the matmul cache and all other optimizations, i can confirm that the results are now correct in wgpu, with a step of just 1.8s on my Macbook M1 Pro. Thank you very much! Are there any other optimizations planned?

Thanks for your patience. Another two optimizations: 1) ConvTranspose 2) Transpose are still on the plan. After that, I suppose this model is in a good state.

asasas234 commented 2 weeks ago

@qjia7 Looking forward very much

gianlourbano commented 2 weeks ago

Hi @qjia7, i have tried the ConvTranspose PR and i'm down to 1.4s. I have noticed a problem: on Chrome 130, which i updated yesterday, my whole pipeline is way slower (both wav2vec2 and demucs, the latter taking almost 3.6s per step), while on Chrome 129 everything runs smoothly.

gyagp commented 2 weeks ago

Hi @qjia7, i have tried the ConvTranspose PR and i'm down to 1.4s. I have noticed a problem: on Chrome 130, which i updated yesterday, my whole pipeline is way slower (both wav2vec2 and demucs, the latter taking almost 3.6s per step), while on Chrome 129 everything runs smoothly.

Can you have a try with Chrome Canary (You may download it from https://www.chromium.org/getting-involved/dev-channel/)?

gianlourbano commented 2 weeks ago

On Canary it's still slower, but not as much: 2.4s. With the Unsafe WebGPU flag on it goes as slow as 4s

Edit: i have noticed it's the same thing with Chrome 130

gyagp commented 2 weeks ago

Just want to double check, are you on M1 pro? Can you share your app so that we can repro the issue? Or you may help to bisect the issue by following the instructions at https://www.chromium.org/developers/bisect-builds-py.

gianlourbano commented 2 weeks ago

Yes, i'm on Mac M1 Pro. I'm using the sample repo, with a locally modified version of onnxruntime-web with the prs that are yet to be merged. I will try to bisect the build

gyagp commented 2 weeks ago

BTW, it's not required to add unsafe WebGPU flag. But it's interesting to know it slows down your app so much.

gyagp commented 2 weeks ago

Yes, i'm on Mac M1 Pro. I'm using the sample repo, with a locally modified version of onnxruntime-web with the prs that are yet to be merged. I will try to bisect the build

Thanks for your effort!

gianlourbano commented 2 weeks ago

I always turn it on because on linux it's required for webgpu to work. Always worker until now :)

gyagp commented 1 week ago

Some team members reminded me this morning that a big change related to WebGPU in Chrome M130 is it enables IR on macOS. Can you simply add option "--disable-dawn-features=use_tint_ir" to your Chrome (to disable IR) and see if the performance could recover?

qjia7 commented 1 week ago

@gianlourbano @gyagp I can reproduce the regression with M1 Pro. The bisect result is as below: You are probably looking for a change made after 1369201 (known good), but no later than 1369208 (first known bad). CHANGELOG URL: https://chromium.googlesource.com/chromium/src/+log/83be517491dda7a23d316065c395d0f0ad584bc1..fb88b76548ad31707ee4fd433fc63419edf48a29

Roll Dawn from b4f991e7eb6e to 0b31a6ca843a (40 revisions)

https://dawn.googlesource.com/dawn.git/+log/b4f991e7eb6e..0b31a6ca843a

2024-10-16 kainino@chromium.org Add more flake expectations
2024-10-16 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 76025caa1a05 to 4a2f9b1ca432 (6 revisions)
2024-10-16 kainino@chromium.org Add flake expectations for pixel 4
2024-10-16 jimblackler@google.com [kotlin] export StringView ToKotlin for callback params in methods.cpp
2024-10-16 lokokung@google.com [dawn][emscripten] Updates callbacks to use StringView.
2024-10-16 lokokung@google.com [dawn][emscripten] Update implementation to handle StringView for inputs.
2024-10-16 jrprice@google.com [ir] Disallow access with no indices
2024-10-16 jrprice@google.com [spirv-reader] Avoid creating access with no indices
2024-10-16 jrprice@google.com DirectVariableAccess: Avoid creating access with no indices
2024-10-15 jrprice@google.com [hlsl] Fix f16 vector element stores in storage
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from 8f346c5caf5a to 4c2208c976c8 (15 revisions)
2024-10-15 jrprice@google.com [glsl] Add fuzzer for IR generator
2024-10-15 cwallez@chromium.org [dawn][webgpu.h] Remove deprecated const char* entrypoints
2024-10-15 dneto@google.com [msl ir] Convince the Metal compiler loops are never infinite
2024-10-15 beaufort.francois@gmail.com Add float32-blendable feature
2024-10-15 chrome-branch-day@chops-service-accounts.iam.gserviceaccount.com Activate dawn M131
2024-10-15 shaobo.yan@intel.com Dawn native/vulkan: PipelineLayoutVk holds multiple VkPipelineLayouts
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll SwiftShader from 7a9a492a38b7 to 74b783dffb9b (1 revision)
2024-10-15 ynovikov@chromium.org Suppress flaky WebGPU CTS compat test on Android ARM
2024-10-15 dneto@google.com Convince the metal compiler that loops are never infinite
2024-10-15 cwallez@chromium.org [dawn][generator] Sort the Python dependencies for the .d files
2024-10-15 cwallez@chromium.org [dawn][webgpu.h] Remove webgpu_cpp.h's use of memcpy.
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 367e9e74a865 to 76025caa1a05 (3 revisions)
2024-10-15 jimblackler@google.com [kotlin] Remove need for @get:JvmName annotation in enum classes
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from a9a924e1ca9b to 367e9e74a865 (3 revisions)
2024-10-15 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll DirectX Shader Compiler from 080aeb7199e6 to 26d7dd984b2b (1 revision)
2024-10-15 lokokung@google.com [dawn][emscripten] Implements getCompilationInfo future entry point.
2024-10-14 cwallez@chromium.org StringViewUtils.cpp: Add include for std::strlen.
2024-10-14 nickchavez@google.com Fixes the Quickstart With CMake guide to use webgpu_cpp_print.h
2024-10-14 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from e0070499f409 to 8f346c5caf5a (1 revision)
2024-10-14 cwallez@chromium.org [dawn][webgpu.h] Use StringView in callback arguments.
2024-10-14 jimblackler@google.com Convert C output params to Kotlin return type for void methods.
2024-10-14 jimblackler@google.com Update test following API change.
2024-10-14 cwallez@chromium.org [dawn][graphite] Add a check for MTLFunction being nil.
2024-10-14 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from 78a694a1b82a to a9a924e1ca9b (1 revision)
2024-10-13 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from e7f0d107f258 to 78a694a1b82a (1 revision)
2024-10-13 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll vulkan-deps from ab901eb0f984 to e0070499f409 (1 revision)
2024-10-12 chrome-automated-expectation@chops-service-accounts.iam.gserviceaccount.com Remove stale WebGPU CTS expectations
2024-10-12 dawn-autoroll@skia-public.iam.gserviceaccount.com Roll ANGLE from a8d9d8138307 to e7f0d107f258 (1 revision)
2024-10-12 amaiorano@google.com HLSL-IR: Fix texture Sample and SampleLevel return type on depth textures

We need to file an issue to chromium/dawn about this regression. I also tried "--disable-dawn-features=use_tint_ir". It seems not related with the regression. fyi @Jiawei-Shao @Kangz See ~3x regressions for demucs model. See https://github.com/microsoft/onnxruntime/issues/22031#issuecomment-2464526306

qjia7 commented 1 week ago

It seems like a MacOS specific issue. Windows looks good. In this case, except dawn, 81951149 Roll Chrome Mac Arm PGO Profile by chromium-autoroll is also suspectable.

gyagp commented 1 week ago

@gianlourbano As we already bisected and got the suspicious changes, your effort is no longer needed. We will further understand the root cause and work with upstream to fix the issue. Thanks for reporting the regression and stay tuned.

gyagp commented 1 week ago

@gianlourbano, @qjia7 helped to submit the issue to Chromium at https://issues.chromium.org/issues/379009123, and you may follow the status there. Google is taking care of this issue and so far the analysis led to a regression while introducing the Tint IR.

gianlourbano commented 1 week ago

@gyagp @qjia7 Thank you very much for your help!

asasas234 commented 6 days ago

@gyagp @qjia7 Hello, I am very interested in running demucs through onnxruntime-web in the browser, but I have no knowledge of machine learning. I tried to read tutorials on how to export PyTorch models as ONNX models, but I didn't quite understand it. How should I construct the input parameters for exporting? So, may I ask if there is a plan for the official team to upload the already converted ONNX model to https://github.com/onnx/models? I think this would make it more convenient for everyone to use. Thank you very much.

gianlourbano commented 6 days ago

Hello @asasas234,this is a modified version of the demucs model that fits my needs. Note that converting the original model from pytorch to onnx took quite a bit, because not all operators were at the time supported by the new dynamo export. Removing these operators made the model successfully convert, but they needed to be implemented elsewhere (e.g,the stft is moved from the beginning of the model to wasm). I suggest to try and convert it again given that the related packages (onnxscript and the dynamo export of torch) are continuously updated and support for new operators may always come

asasas234 commented 6 days ago

@gianlourbano Thank you, I tried to run your project demucs-onnx before, but it didn't convert successfully. I'm not very familiar with this technology, but it seems that onnx supports running demucs, so I want to seek official support.

gianlourbano commented 6 days ago

@asasas234 demucs-onnx is actually just a sample repo to run the model with random data, all of the removed operators from the models and the pre/post processing are missing, so it cant actually be used to demux real audios. I cannot share my implementation as its private. Note also that you need to install the latest @dev of onnxruntime-web to run the model