[Performance] onnxruntime allocates lots of cuda memory on T4

knitvoger commented 1 year ago

Describe the issue

Cuda memory that ort allocates for creating the session is much bigger than model size on T4. Pls check below table.

Loading a 70KB model needs 376MB cuda memory on T4 and 173MB on K80. Why memory cost on T4 is much bigger than K80?

I tried to set cudnn_conv_use_max_workspace to false. But this does not reduce the memory.

Environments on T4 and K80 are totally same: ort 1.14.1 and cuda 11.

This is my test code

#include <onnxruntime_cxx_api.h>
#include <onnxruntime_session_options_config_keys.h>
#include <string>
#include <unistd.h>
#include <assert.h>

int main(int argc, char *argv[])
{
    Ort::Env *env = new Ort::Env (ORT_LOGGING_LEVEL_ERROR, "onnx env");
    auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    Ort::SessionOptions session_options;
    session_options.SetIntraOpNumThreads(1);
    session_options.SetInterOpNumThreads(1);
    session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    session_options.SetExecutionMode(ORT_SEQUENTIAL);
    session_options.DisableMemPattern();

    OrtCUDAProviderOptionsV2* cuda_options_v2 = nullptr;
    assert(Ort::GetApi().CreateCUDAProviderOptions(&cuda_options_v2) == nullptr);
    std::vector<const char*> keys{"device_id", "gpu_mem_limit", "arena_extend_strategy", "cudnn_conv_use_max_workspace"};
    std::vector<const char*> values{"0", "2147483648", "kSameAsRequested", "0"};
    assert(Ort::GetApi().UpdateCUDAProviderOptions(cuda_options_v2, keys.data(), values.data(), values.size()) == nullptr);
    assert(Ort::GetApi().SessionOptionsAppendExecutionProvider_CUDA_V2(session_options, cuda_options_v2) == nullptr);

    Ort::Session *session = new Ort::Session(*env, "70kb.onnx", session_options);
    system("nvidia-smi");
}

Model - model.zip

To reproduce

Pls run the code and model I attached.

Urgency

The big memory usage on T4 make us can only run a few models on T4. We need a lot of T4 to run our models, while T4's gpu utilization rate is only around 30%. This is big cost for our service.

Platform

Linux

OS Version

ubuntu 18.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11

Model File

No response

Is this a quantized model?

No

tianleiwu commented 1 year ago

T4 has tensor core, so it has more choices of convolution algorithms in cuDNN. Different convolution algorithm uses different size of workspace.

You can try tune a few parameters: gpu_mem_limit, cudnn_conv_algo_search and cudnn_conv1d_pad_to_nc1d to see what's change in memory usage and performance.

BTW, your model is very simple, which means you are able to use larger batch size in T4 than K80.

knitvoger commented 1 year ago

T4 has tensor core, so it has more choices of convolution algorithms in cuDNN. Different convolution algorithm uses different size of workspace.

You can try tune a few parameters: gpu_mem_limit, cudnn_conv_algo_search and cudnn_conv1d_pad_to_nc1d to see what's change in memory usage and performance.

BTW, your model is very simple, which means you are able to use larger batch size in T4 than K80.

Thanks @tianleiwu. I have tried the parameters. But they don't bring any change to the memory. Are there any other parameters I can try?

microsoft / onnxruntime