Open guschmue opened 11 months ago
What causes this problem?
I think it is probable that gpu buffers are not efficiently reused. For each decoder running, lots of buffers are dynamically allocated instead of reusing existed buffers. I see https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/webgpu/gpu-data-manager.ts#L267 is called many times for each inference. The current gpu buffer reuse strategy is not friendly for dynamic models. For each run, the input shapes will change, which results the needed buffer size changes and can't reuse last inference's buffers since the reuse strategy require exact matching buffer size to reuse. We may need to change the reuse strategy to reduce dynamically allocating buffers to see whether the perf can be improved. And another issue is I still see some data download from gpu to cpu several times during each inference even I choose webgpu + io binding. We need to make sure no unnecessary data read back during inference.
Describe the issue
running generative decoders via webgpu (ie t5-small, whisper) are slower than wasm while there is plenty of gpu cycle available (gpu is 15% busy). We know kernel times look good, cross device copy looks good. Even with io-bindings it is still slower than wasm.
To reproduce
https://github.com/guschmue/ort-web-perf/blob/master/ort-t5.html
Urgency
No response
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
main
Execution Provider
'webgpu' (WebGPU)