Open knitvoger opened 1 year ago
T4 has tensor core, so it has more choices of convolution algorithms in cuDNN. Different convolution algorithm uses different size of workspace.
You can try tune a few parameters: gpu_mem_limit
, cudnn_conv_algo_search
and cudnn_conv1d_pad_to_nc1d
to see what's change in memory usage and performance.
BTW, your model is very simple, which means you are able to use larger batch size in T4 than K80.
T4 has tensor core, so it has more choices of convolution algorithms in cuDNN. Different convolution algorithm uses different size of workspace.
You can try tune a few parameters:
gpu_mem_limit
,cudnn_conv_algo_search
andcudnn_conv1d_pad_to_nc1d
to see what's change in memory usage and performance.BTW, your model is very simple, which means you are able to use larger batch size in T4 than K80.
Thanks @tianleiwu. I have tried the parameters. But they don't bring any change to the memory. Are there any other parameters I can try?
Describe the issue
Cuda memory that ort allocates for creating the session is much bigger than model size on T4. Pls check below table.
Loading a 70KB model needs 376MB cuda memory on T4 and 173MB on K80. Why memory cost on T4 is much bigger than K80?
I tried to set cudnn_conv_use_max_workspace to false. But this does not reduce the memory.
Environments on T4 and K80 are totally same: ort 1.14.1 and cuda 11.
This is my test code
Model - model.zip
To reproduce
Pls run the code and model I attached.
Urgency
The big memory usage on T4 make us can only run a few models on T4. We need a lot of T4 to run our models, while T4's gpu utilization rate is only around 30%. This is big cost for our service.
Platform
Linux
OS Version
ubuntu 18.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda 11
Model File
No response
Is this a quantized model?
No