[Performance] Why run first inference so slow, although run one time in initialzation?

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.61k stars 2.92k forks source link

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

Open nistarlwc opened 9 months ago

nistarlwc commented 9 months ago

Describe the issue

Build a class to create the model and inference. In initialition, created a random data and run one time. But when run other data, first inference is so slow, Why? If wait some seconds, then run the next data, it will be slow.

To reproduce

class SemanticSegment(object):
    def __init__(self, model_path):
        sess_providers = ['CUDAExecutionProvider']
        sess_options = rt.SessionOptions()
        self.session = rt.InferenceSession(model_path, sess_options, sess_providers)
        self.input_name = self.session.get_inputs()[0].name
        zero_image = np.zeros([BATCH_SIZE, SEG_SIZE_H, SEG_SIZE_W, 3], dtype=np.uint8)
        _ = self.session.run(None, {self.input_name: zero_image})

    def predict(self, image):
        input_tensor = np.expand_dims(image, axis=0)
        prediction = self.session.run(None, {self.input_name: input_tensor})[0]
        return prediction

Urgency

No response

Platform

Windows

OS Version

win10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.15

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

Model File

No response

Is this a quantized model?

xadupre commented 9 months ago

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

nistarlwc commented 9 months ago

@xadupre Thank you for you reply

The first call to sesssion.run is slower than the second call to sesssion.run . And the second call to sesssion.run is slower than the third call to sesssion.run . After the 4th or 5th time, the call can be stable to same time. If wait some seconds, then the next call will be slower.

Only use CUDA to predict, can look the code, sess_providers = ['CUDAExecutionProvider']

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

xadupre commented 9 months ago

For the first iteration, you data is copied from cpu to gpu. Maybe that's not the case for the others. CUDA is usually faster after a few iterations (warm-up). Benchmarks on CUDA usually expose a warmup parameter to take that effect into account.

hariharans29 commented 9 months ago

Are the image dimensions fixed or bound to vary ? Please see this issue if your image dimensions are dynamic and bound to vary to optimize for this use-case. Also see this related documentation.

In general, the first inference run is expected to be a lot slower than the second run as the first run is where most CUDA memory allocations happen (this is costly) and cached in the memory pool for subsequent runs. Ensure that the warm up run (first run) you do is for the same image shape as the subsequent runs if the image size is fixed. If you do this, the second run shouldn't be a lot slower than the third run (assuming image dimensions are fixed between second and third calls). If you have ensured all the above, how slow is the second inference call relative to the third call ?

'If wait some seconds, then the next call will be slower.' - Are you saying that if there is a delay introduced between runs, then inference runs are slower? If so, please see this issue

nistarlwc commented 9 months ago

@xadupre I think that the problem is warm-up too, but how to solve? In the project, the run time is very importent.

nistarlwc commented 9 months ago

@hariharans29 Thank you for you reply The image dimensions are fixed. Try to set GPU power, like 156086029-264b495c-38c9-414d-b77a-0012f9d5c43e But the run time is not improved

nistarlwc commented 9 months ago

The test results:

Use 300 images for a Iteration fisr Iteration: run time : 110.5 run time : 79.6 run time : 54.3 run time : 6.9 run time : 6.9 run time : 6.9 ......

wait 2s and run second Iteration: run time : 57.8 run time : 56.8 run time : 58.8 run time : 6.9 run time : 6.9 run time : 6.9 ......

nistarlwc commented 9 months ago

@xadupre @hariharans29 Help!!! The problem is very serious. Sometimes first ~10th predict will be very slow.

I try to test with tensorflow, but don't have the problem. I think the problem is static graph and dynamic graph.
But how to use onnxruntime with staticgraph?

xadupre commented 9 months ago

Is it possible to share the full script you use to run your benchmark?

nistarlwc commented 9 months ago

@xadupre https://github.com/nistarlwc/test-onnx-fastapi

xadupre commented 9 months ago

So you run onnxruntime in a multithreaded environment. Based on your code, you have one instance of onnxruntime potentially called from multiple threads. onnxruntime is designed to use all the cores by default. Python should avoid mutliple calls to onnxruntiume at the same time (GIL) but maybe onnxruntime is changing the way it manages the memory if it detects multiple threads coming in. Maybe @hariharans29 knows more about that.

nistarlwc commented 9 months ago

@xadupre @hariharans29, although the httpserver is used with the multi-threading model, when onnxruntime is called, the images are predicted one by one.