microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.6k stars 2.92k forks source link

[Web] generative decoders are slower than they should be #18754

Open guschmue opened 11 months ago

guschmue commented 11 months ago

Describe the issue

running generative decoders via webgpu (ie t5-small, whisper) are slower than wasm while there is plenty of gpu cycle available (gpu is 15% busy). We know kernel times look good, cross device copy looks good. Even with io-bindings it is still slower than wasm.

To reproduce

https://github.com/guschmue/ort-web-perf/blob/master/ort-t5.html

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

main

Execution Provider

'webgpu' (WebGPU)

lxfater commented 10 months ago

What causes this problem?

qjia7 commented 10 months ago

I think it is probable that gpu buffers are not efficiently reused. For each decoder running, lots of buffers are dynamically allocated instead of reusing existed buffers. I see https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/gpu-data-manager.ts#L267 is called many times for each inference. The current gpu buffer reuse strategy is not friendly for dynamic models. For each run, the input shapes will change, which results the needed buffer size changes and can't reuse last inference's buffers since the reuse strategy require exact matching buffer size to reuse. We may need to change the reuse strategy to reduce dynamically allocating buffers to see whether the perf can be improved. And another issue is I still see some data download from gpu to cpu several times during each inference even I choose webgpu + io binding. We need to make sure no unnecessary data read back during inference.