microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.06k stars 2.83k forks source link

[Performance] Dynamic Shape performance #13198

Open SWHL opened 1 year ago

SWHL commented 1 year ago

Describe the issue

To reproduce

import numpy as np
import onnxruntime as ort
from tqdm import tqdm
import time

class TestOrtInfer(object):
    def __init__(self, onnx_path, batch_size=1, total_samples=1000):
        self.onnx_path = onnx_path
        self.total_samples = total_samples
        self.batch_size = batch_size
        self.x = np.random.randn(*[batch_size, 3, 224, 224]).astype(np.float32)

    def init_session(self, use_gpu=False):
        self.use_gpu = use_gpu
        if self.use_gpu:
            exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        else:
            exproviders = ["CPUExecutionProvider"]

        self.ort_session = ort.InferenceSession(self.onnx_path,
                                                providers=exproviders)
        self.input_name = self.ort_session.get_inputs()[0].name
        self.output_name = self.ort_session.get_outputs()[0].name

    def infer(self, is_dynamic=False):
        latency = []
        print('Number of runs:', self.total_samples)
        for i in tqdm(range(self.total_samples)):
            if is_dynamic:
                w = np.random.randint(128, 1024)
                w = int(round(w / 32) * 32)

                h = np.random.randint(128, 1024)
                h = int(round(h / 32) * 32)
            else:
                h, w = 576, 576

            self.x = np.random.randn(*[self.batch_size, 3, h, w]).astype(np.float32)

            t0 = time.time()
            self.ort_session.run(None, {self.input_name: self.x})
            latency.append(time.time() - t0)

        avg_time = sum(latency) * 1000 / len(latency)
        device = 'GPU' if self.use_gpu else 'CPU'
        print(f"Average onnxruntime {device} " \
              f"Inference time = {avg_time:.2f} ms")

onnx_path = 'OCRv3_det_infer.onnx'
tester = TestOrtInfer(onnx_path, batch_size=1, total_samples=100)

# CPU Inference
tester.init_session(use_gpu=False)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)

# GPU Inference
tester.init_session(use_gpu=True)
tester.infer(is_dynamic=False)
tester.infer(is_dynamic=True)

Urgency

No response

Platform

Linux

OS Version

Ubuntu

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2

Model File

OCRv3_det_infer.zip

Is this a quantized model?

No

ytaous commented 1 year ago

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

ruhyadi commented 1 year ago

I have the same problems when dealing with recognition model of PaddleOCR, due to dynamic shapes [-1, 3, 48, -1]. My suggestion is to warmup the model before doing inference, here snippet of model warmups

    ...
    def model_warmup(self, batch_size: int = 1, min_size: int = 300, max_size: int = 1500):
        """
        Recognition model have input size: [-1, 3, 48, -1]
        ONNXRuntime with CUDA support is not performing well with arbitrary input size
        So we need to warmup the model with arbitrary input size
        """
        log.info("Warming up model...")
        for i in tqdm(range(min_size, max_size), desc="Warming up model"):
            dummy_input = np.random.randn(batch_size, 3, 48, i).astype(np.float32)
            self.recog_session.run([self.recog_output_name], {self.recog_input_name: dummy_input})
        log.info("Model warmup completed")
    ...
CDboyOne commented 1 month ago

You can try to run the same shape 10 times and discard the time from the first run. Your number should be comparable to the static ones. If you keep changing the shape for each run, a lot of cached data will be invalidated and rebuilt.

@ytaous How many cache will ORT preserve for each model? For example, I have one model and inference 10 times with different input shapes. Will ORT preserve the last one shape cache, or last N shape cache? Or This is decided by another algorithm?