microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.69k stars 2.93k forks source link

onnxruntime-web bug : 'return f4' is fast as normal, but 'return f1,f2,f3,f4' is very slow in webgl mode #12609

Open yufengyao-lingoace opened 2 years ago

yufengyao-lingoace commented 2 years ago

(1)slow when return f1,f2,f3,f4 def forward(self, x): f1 = self.b1(x) f2 = self.b2(f1)
f3 = self.b3(f2) f4 = self.b4(f3) return f1,f2,f3,f4 when I return f1,f2,f3,f4 , the onnx used in chrome is very slow. (ort.min.js using webgl)

(2) fast when return f4 def forward(self, x): f1 = self.b1(x) f2 = self.b2(f1)
f3 = self.b3(f2) f4 = self.b4(f3) return f4 when I only return f4 , the onnx used in chrome is very fast. (ort.min.js using webgl) but in normal , the two way should be the same time I also find that def forward(self, x): f1 = self.b1(x) f2 = self.b2(f1) return f2 is faster than def forward(self, x): f1 = self.b1(x) return f1 So anyone can tell me why?

yufengyao-lingoace commented 2 years ago

Anyone can help me?

shalvamist commented 2 years ago

Hey yufengyao,

Thanks for the details, can you please share a bit more info -

  1. The code looks like it's python - how are you deploying it to the browser ?
  2. Do you mind sharing the model so we could reproduce the issue and debug it ?
  3. Can you share the env set up and onnxruntime version used to produce this issue ?
yufengyao-lingoace commented 2 years ago

Hello, Thanks to you reply. It is easy to reproduce the issue. I create the model with pytorch, and then exported an onnx file, then use onnxruntime-web to run the onnx model. The pytorch code is as follow. Since hardsigmoid and hardwish is not supported by onnxruntime-web, I rewrite these two functions. It is normal to run using cpu or cuda, but very slow to run in chrome using either wasm or webgl. I guess that when I return f1,f2,f3,f4, actually it run f1 to get f1, and run f1,f2 to get f2, then f1,f2,f3 to get f3, finally, run f1,f2,f3,f4 to get f4 . So the f1 is calculated four times, f2 is calculated three times , f1 is calculated two times. Here is also a problem that :

def forward(self, x): //first model f1 = self.b1(x) f2 = self.b2(f1) return f2

is faster than

def forward(self, x): //second model f1 = self.b1(x) return f1

The calculation amount of the first model is larger than the second, but the first model runs slower. So I think there is something wrong with our ort.js. It does not work well with this model.

Below is my code:

import time from turtle import forward import cv2 import onnx import torch import torchvision import onnxruntime import numpy as np import torch.nn as nn from onnxsim import simplify import torch.nn.functional as F from torchvision.transforms.functional import normalize

class MyModel(nn.Module): def init(self): super().init() self.backbone = torchvision.models.mobilenet_v3_large() del self.backbone.avgpool del self.backbone.classifier self.b1 = nn.Sequential(self.backbone.features[0], self.backbone.features[1]) self.b2 = nn.Sequential(self.backbone.features[2], self.backbone.features[3]) self.b3 = nn.Sequential(self.backbone.features[4], self.backbone.features[5], self.backbone.features[6]) self.b4 = nn.Sequential(self.backbone.features[7], self.backbone.features[8], self.backbone.features[9], self.backbone.features[10], self.backbone.features[11], self.backbone.features[12], self.backbone.features[13], self.backbone.features[14], self.backbone.features[15], self.backbone.features[16])

def forward(self, x):
    f1 = self.b1(x)
    f2 = self.b2(f1)  # 1, 24, 60, 80
    f3 = self.b3(f2)  # 1,40,30,40
    f4 = self.b4(f3)
    return f1,f2,f3,f4

class MyHardswish(torch.nn.Module):

@staticmethod

def forward(self, x):
    return x * F.hardtanh(x + 3, 0., 6.) / 6.

class MyHardsigmoid(torch.nn.Module):

@staticmethod

def forward(self, x):
    return F.relu6(x + 3., inplace=True) / 6.

def _set_module(model, submodule_key, module): tokens = submodule_key.split('.') sub_tokens = tokens[:-1] cur_mod = model

for s in sub_tokens:
    cur_mod = getattr(cur_mod, s)
setattr(cur_mod, tokens[-1], module)

model = MyModel() model.eval()

input = torch.randn(1, 3, 240, 360) output = model(input)

for k, m in model.named_modules(): if isinstance(m, torch.nn.Hardswish): _set_module(model, k, MyHardswish()) if isinstance(m, torch.nn.Hardsigmoid): _set_module(model, k, MyHardsigmoid())

torch.onnx.export( model, (input), f='test.onnx', opset_version=12, input_names=['input'], output_names=['output'], )

shalvamist commented 2 years ago

Hey,

Thanks for the update and details. Looks like I have enough info to start investigating the issue. I will try to reproduce the issue on my end and I will keep you posted with the results.

shalvamist commented 2 years ago

Hey,

Sorry for the delay in response - I think I will need more info and maybe a updated script from your end to take this issue forward.

I have been working on the script you shared trying to get it to run and generate the model you are using but I am having issues with running it. I had to change a few things to make it run but now it seems like there is an issue with the input/output expect dims.

Here is my current version of the script - import time from turtle import forward import cv2 import onnx import torch import torchvision import onnxruntime import numpy as np import torch.nn as nn from onnxsim import simplify import torch.nn.functional as F from torchvision.transforms.functional import normalize

class MyModel(nn.Module): def init(self): super(MyModel,self).init() self.backbone = torchvision.models.mobilenet_v3_large()

del self.backbone.avgpool

    #del self.backbone.classifier
    self.b1 = nn.Sequential(self.backbone.features[0],self.backbone.features[1])
    self.b2 = nn.Sequential(self.backbone.features[2],self.backbone.features[3])
    self.b3 = nn.Sequential(self.backbone.features[4],self.backbone.features[5], self.backbone.features[6])
    self.b4 = nn.Sequential(self.backbone.features[7],self.backbone.features[8], self.backbone.features[9], self.backbone.features[10],self.backbone.features[11], self.backbone.features[12], self.backbone.features[13],self.backbone.features[14], self.backbone.features[15], self.backbone.features[16])

def forward(self, x):
    f1 = self.backbone(x)
    f2 = self.b2(f1)  # 1, 24, 60, 80
    f3 = self.b3(f2)  # 1,40,30,40
    f4 = self.b4(f3)
    return f1

class MyHardswish(torch.nn.Module):

@staticmethod

def forward(self, x):
    return x * F.hardtanh(x + 3, 0., 6.) / 6.

class MyHardsigmoid(torch.nn.Module):

@staticmethod

def forward(self, x):
    return F.relu6(x + 3., inplace=True) / 6.

def _set_module(model, submodule_key, ): tokens = submodule_key.split('.') sub_tokens = tokens[:-1] cur_mod = model

for s in sub_tokens:
    cur_mod = getattr(cur_mod, s)
    setattr(cur_mod, tokens[-1], module)

model = MyModel() model.eval()

input = torch.randn(1, 3, 240, 360) output = model(input)

for k, m in model.named_modules(): if isinstance(m, torch.nn.Hardswish): _set_module(model, k, MyHardswish()) if isinstance(m, torch.nn.Hardsigmoid): _set_module(model, k, MyHardsigmoid())

torch.onnx.export( model, (input), f='test.onnx', opset_version=12, input_names=['input'], output_names=['output'], )

The error I get is - RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1000]

Can you please share the script and env you are using ? it might be torch version difference as well. Also if you can share the html file you are using for loading the model to the browser it will be best so I will have the means to follow your steps and reproduce the issue.

Thanks

yufengyao-lingoace commented 2 years ago

Thanks for your reply, below is my javascript code. The main function is detect().:

<!DOCTYPE html>

test
yufengyao-lingoace commented 2 years ago

Hi, I tried your code, and found a difference(the problem is not related to the environment of pyhton or torch ):

the right code is: def forward(self, x): f1 = self.b1(x)

but yours is: def forward(self, x): f1 = self.backbone(x)

just change self.backbone to self.b1 and then can export the onnx file. The above is my javascript code, you can try to use that. Thanks very much!

shalvamist commented 2 years ago

Hey,

Thanks for the response - I was able to generate the two models and test them with random inputs. Here is the code -

<script` src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
<script>
   // use an async context to call onnxruntime functions.
   async function main() {
        try{
            document.write(`init model`);
            session = await ort.InferenceSession.create('./test_fast.onnx'); 
            console.log(session);

            // prepare inputs. a tensor need its corresponding TypedArray as data
            var scaled_size = [360, 240];
            let fps=0;
            for (let i = 0; i < 50; i++) {

                let new_Data = new Float32Array(3 * scaled_size[0] * scaled_size[1],() => Math.floor(Math.random() * 256));

                let input = new ort.Tensor('float32', new_Data, [1, 3, scaled_size[1], scaled_size[0]]);

                let feeds = {
                    "input": input,
                }
                // prepare feeds. use model input names as keys.

                // feed inputs and run
                var old_date = new Date();
                const results = await session.run(feeds); 
                var new_date = new Date();

                fps = fps+(new_date - old_date);

            }

            document.write(`FPS: ${(1000 / (fps/50)).toFixed(4)}`);

        } catch (e) {

            console.log(e);
            document.write(`failed to inference ONNX model: ${e}.`);
        }
    }
    main();
</script>

The two versions of the models were able to get about the same amount of FPS for a 50 inference run - ~17 FPS. I couldn't see any major difference between them. Can you please try this on your end ?

yufengyao-lingoace commented 2 years ago

Hi, Thank you for you test. I tried my code again, and when use wasm mode, I get 17fps for the two models as the same as you. But when I use webgl mode, the model(return f1,f2,f3,f4) got only 8 fps, and the model(return f4) got 30fps. It seems that this problem is only shown in webgl mode. Return f4 is four times faster than return f1,f2,f3,f4. Please try this code again. Thanks very much! options = { executionProviders: ['webgl'], }; session = await ort.InferenceSession.create('./weights/test-fast.onnx', options); My test computer is : MacBook Pro (13-inch, 2020, Four Thunderbolt 3 ports) Intel Iris Plus Graphics 1536 MB

shalvamist commented 2 years ago

Hey,

Thanks for pointing this out - Indeed I see a big difference between the two models under webgl. On the "fast model" I can get ~11-12 FPS On the "slow" model I am getting ~6 FPS

I will investigate a bit more and update you with my findings.

shalvamist commented 2 years ago

Hi,

Sorry for a long wait on this issue. We had a few discussions on the topic and it looks like the performance degradation is due to the data transfer from GPU <-> CPU. This also coincides with the analysis we performed, we found that the more outputs are defined the more performance degradation we see. There are a few ways we can mitigate this performance loss but it will require some effort and most likely an output stacking feature to be implemented.

As of now we don't have this feature planned on our road map but we took note and once resources free-up (or demand for this feature will increased) we'll revisit the implementation plan.

Hope this answer is suitable for you as of now. Please let me know if you have any comments.

Thanks

yufengyao-lingoace commented 2 years ago

Thanks very much. With the problem found out, I will try a stacking feature in my model. Thanks again. With best wishes.

liuyingbin123 commented 1 year ago

same problem,i have 5 outputs for my model,wasm is much faster than webgl