Open nistarlwc opened 9 months ago
To be more precise, the first call to method predict
(so second call to sesssion.run
) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).
@xadupre Thank you for you reply
The first call to sesssion.run
is slower than the second call to sesssion.run
.
And the second call to sesssion.run
is slower than the third call to sesssion.run
.
After the 4th or 5th time, the call can be stable to same time.
If wait some seconds, then the next call will be slower.
Only use CUDA to predict, can look the code, sess_providers = ['CUDAExecutionProvider']
To be more precise, the first call to method
predict
(so second call tosesssion.run
) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).
For the first iteration, you data is copied from cpu to gpu. Maybe that's not the case for the others. CUDA is usually faster after a few iterations (warm-up). Benchmarks on CUDA usually expose a warmup parameter to take that effect into account.
Are the image dimensions fixed or bound to vary ? Please see this issue if your image dimensions are dynamic and bound to vary to optimize for this use-case. Also see this related documentation.
In general, the first inference run is expected to be a lot slower than the second run as the first run is where most CUDA memory allocations happen (this is costly) and cached in the memory pool for subsequent runs. Ensure that the warm up run (first run) you do is for the same image shape as the subsequent runs if the image size is fixed. If you do this, the second run shouldn't be a lot slower than the third run (assuming image dimensions are fixed between second and third calls). If you have ensured all the above, how slow is the second inference call relative to the third call ?
'If wait some seconds, then the next call will be slower.' - Are you saying that if there is a delay introduced between runs, then inference runs are slower? If so, please see this issue
@xadupre I think that the problem is warm-up too, but how to solve? In the project, the run time is very importent.
@hariharans29 Thank you for you reply The image dimensions are fixed. Try to set GPU power, like But the run time is not improved
The test results:
Use 300 images for a Iteration fisr Iteration: run time : 110.5 run time : 79.6 run time : 54.3 run time : 6.9 run time : 6.9 run time : 6.9 ......
wait 2s and run second Iteration: run time : 57.8 run time : 56.8 run time : 58.8 run time : 6.9 run time : 6.9 run time : 6.9 ......
@xadupre @hariharans29 Help!!! The problem is very serious. Sometimes first ~10th predict will be very slow.
I try to test with tensorflow, but don't have the problem.
I think the problem is static graph and dynamic graph.
But how to use onnxruntime with staticgraph?
Is it possible to share the full script you use to run your benchmark?
So you run onnxruntime in a multithreaded environment. Based on your code, you have one instance of onnxruntime potentially called from multiple threads. onnxruntime is designed to use all the cores by default. Python should avoid mutliple calls to onnxruntiume at the same time (GIL) but maybe onnxruntime is changing the way it manages the memory if it detects multiple threads coming in. Maybe @hariharans29 knows more about that.
@xadupre @hariharans29, although the httpserver is used with the multi-threading model, when onnxruntime is called, the images are predicted one by one.
Describe the issue
Build a class to create the model and inference. In initialition, created a random data and run one time. But when run other data, first inference is so slow, Why? If wait some seconds, then run the next data, it will be slow.
To reproduce
Urgency
No response
Platform
Windows
OS Version
win10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-gpu==1.15
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No