[Performance] RE: When using CUDA the first run is very slow -- cudnn_conv_algo_search

hmaarrfk commented 6 months ago

Describe the issue

I didn't want to reply to https://github.com/microsoft/onnxruntime/issues/10746 since it was mentionned that the issue is a placeholder.

I wanted to say that in our work, we've found that issue to have omitted a critical piece of information regarding the effect of cudnn_conv_algo_search to the performance of the first run.

The default value, EXAUSTIVE as mentioned in the C API and the Python documentation

Seems to be a significant contributor to this effect. It would be good if a small note were added in that placeholder issue to mention that users would have a choice in the session optimization strategy.

Thank you @davidmezzetti for bringing thing to my attention in your blog post https://medium.com/neuml/debug-onnx-gpu-performance-c9290fe07459

cc: @jefromson

To reproduce

Start your onnx session with the following options:

and change between the different options for cudnn_conv_algo_search

    providers=[
        ("CUDAExecutionProvider", {
            # "cudnn_conv_algo_search": "DEFAULT",
            # "cudnn_conv_algo_search": "HEURISTIC",
            "cudnn_conv_algo_search": "EXHAUSTIVE",
        }),
        # "CPUExecutionProvider",
      ]

Urgency

just a small tip for others.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.0

Model File

No response

Is this a quantized model?

No

hmaarrfk commented 6 months ago

To make this easy, a sentence could be added:

Even in the case that Onnxruntime is pre-built with the binary code for your GPU architecture, by default, the CUDA Onnxruntime will perform an an exhaustive search for the best performant cuDNN convolution algorithm. This is controlled by the parameter cudnn_conv_algo_search and can be specified at the session creation time. See LINK TO YOUR CHOSEN DOCUMENTATION for more information.

hariharans29 commented 6 months ago

Thanks for the feedback - @hmaarrfk. It is good to document this. Please keep in mind though- that the slowness of the first Run() may not be limited to just this. The allocations to grow the underlying memory pool could also cause the first Run() to be slow(er) than subsequent runs. Usually, a good practise is to do a few warm-up Runs using the session instance using representative inputs before the "real" Runs.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

hmaarrfk commented 5 months ago

no stale. doc was not updated.

spoorgholi74 commented 2 months ago

@hmaarrfk I faced a similar issue.

What does the "cudnn_conv_algo_search" flag actually do. Cause I can also notice it is slow for the first few (up to 100 runs) and then the "EXHAUSTIVE" is just a tiny bit faster.

jefromson commented 2 months ago

@spoorgholi74 it is trying various convolution algorithms to choose the fastest and it needs to run them all once to time them before choosing which to use overall.

We found that of the three options (EXHAUSTIVE, DEFAULT, and HEURISTIC), HEURISTIC is the fastest and yields great results.

slashedstar commented 2 weeks ago

This solved a problem I was having but still leaves me wondering shouldn't it be caching the results from the exhaustive search instead of performing it on every run? After reading https://github.com/microsoft/onnxruntime/issues/10746 I've tried setting os.environ["CUDA_CACHE_MAXSIZE"] = "4294967296" and CUDA_CACHE_PATH to a known path but nothing worked (the default CUDA_CACHE_PATH was probably fine too since it had some files from other stuff)

microsoft / onnxruntime