[Performance] Dynamic Shape performance

SWHL commented 1 year ago

Describe the issue

I am using the onnxrutime to make inference on CPU and GPU. The input I use for the model is dynamic shape.
Whether it is CPU or GPU, onnxruntime inference time on static shape input is shorted than dynamic shape input.
Is there a way to optimize the inference time of the model in the case of dynamic input?

To reproduce

import numpy as np
import onnxruntime as ort
from tqdm import tqdm
import time

class TestOrtInfer(object):
    def __init__(self, onnx_path, batch_size=1, total_samples=1000):
        self.onnx_path = onnx_path
        self.total_samples = total_samples
        self.batch_size = batch_size
        self.x = np.random.randn(*[batch_size, 3, 224, 224]).astype(np.float32)

    def init_session(self, use_gpu=False):
        self.use_gpu = use_gpu
        if self.use_gpu:
            exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        else:
            exproviders = ["CPUExecutionProvider"]

        self.ort_session = ort.InferenceSession(self.onnx_path,
                                                providers=exproviders)
        self.input_name = self.ort_session.get_inputs()[0].name
        self.output_name = self.ort_session.get_outputs()[0].name

    def infer(self, is_dynamic=False):
        latency = []
        print('Number of runs:', self.total_samples)
        for i in tqdm(range(self.total_samples)):
            if is_dynamic:
                w = np.random.randint(128, 1024)
                w = int(round(w / 32) * 32)

                h = np.random.randint(128, 1024)
                h = int(round(h / 32) * 32)
            else:
                h, w = 576, 576

            self.x = np.random.randn(*[self.batch_size, 3, h, w]).astype(np.float32)

            t0 = time.time()
            self.ort_session.run(None, {self.input_name: self.x})
            latency.append(time.time() - t0)

        avg_time = sum(latency) * 1000 / len(latency)
        device = 'GPU' if self.use_gpu else 'CPU'
        print(f"Average onnxruntime {device} " \
              f"Inference time = {avg_time:.2f} ms")

onnx_path = 'OCRv3_det_infer.onnx'
tester = TestOrtInfer(onnx_path, batch_size=1, total_samples=100)

# CPU Inference
tester.init_session(use_gpu=False)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)

# GPU Inference
tester.init_session(use_gpu=True)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)

The result:

Device	Model	Input shape	Loops	Average cost
CPU	OCRv3_det_infer.onnx	1x3x576x576	100	283.94ms
CPU	OCRv3_det_infer.onnx	1x3xHxW dynamic	100	321.17ms
GPU	OCRv3_det_infer.onnx	1x3x576x576	100	11.71ms
GPU	OCRv3_det_infer.onnx	1x3xHxW dynamic	100	445.36ms

Urgency

No response

Platform

Linux

OS Version

Ubuntu

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2

Model File

OCRv3_det_infer.zip

Is this a quantized model?

No

ytaous commented 1 year ago

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

ruhyadi commented 1 year ago

I have the same problems when dealing with recognition model of PaddleOCR, due to dynamic shapes [-1, 3, 48, -1]. My suggestion is to warmup the model before doing inference, here snippet of model warmups

    ...
    def model_warmup(self, batch_size: int = 1, min_size: int = 300, max_size: int = 1500):
        """
        Recognition model have input size: [-1, 3, 48, -1]
        ONNXRuntime with CUDA support is not performing well with arbitrary input size
        So we need to warmup the model with arbitrary input size
        """
        log.info("Warming up model...")
        for i in tqdm(range(min_size, max_size), desc="Warming up model"):
            dummy_input = np.random.randn(batch_size, 3, 48, i).astype(np.float32)
            self.recog_session.run([self.recog_output_name], {self.recog_input_name: dummy_input})
        log.info("Model warmup completed")
    ...

CDboyOne commented 1 month ago

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

@ytaous How many cache will ORT preserve for each model? For example, I have one model and inference 10 times with different input shapes. Will ORT preserve the last one shape cache, or last N shape cache? Or This is decided by another algorithm?

microsoft / onnxruntime