[Performance] Data size of Batch Normalization using cuDNN in inference.

chester1uo commented 1 year ago

Describe the issue

Hello, when I process a data shape in (70000,16) during inference, it has error message:

2023-09-04 10:54:38.941019182 [E:onnxruntime:Model, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); 2023-09-04 10:54:38.941211047 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); terminate called after throwing an instance of 'Ort::Exception' what(): Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon);

It looks like the BatchNormalization implemented by cuDNN in onnxruntime doesn't support data size like this. I try to reduce the number of channels but not help, and then I reduce the number of data in each channel below 50000 then it works.

I wonder that is there any exact limitations on the data scale when using batch normalization? And how to solve if I really need to use data in this scale in inference process.

To reproduce

This issue will happen in both C++ and Python API.

Here is one case in Python:

import numpy as np
import torch
import torch.nn as nn
import onnx
import onnxruntime as ort

# 1. Generate random data
data = np.random.rand(68000, 16).astype(np.float32)

# 2. Define a simple model in PyTorch with a batch normalization layer
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.batch_norm = nn.BatchNorm1d(16)

    def forward(self, x):
        return self.batch_norm(x)

model = SimpleModel().cuda()

# Convert data to PyTorch tensor
tensor_data = torch.tensor(data).cuda()

# 3. Export the PyTorch model to ONNX format
torch.onnx.export(model, tensor_data, "simple_model.onnx", verbose=True, input_names=['input'], output_names=['output'])

# 4. Perform inference using ONNX Runtime
ort_session = ort.InferenceSession("simple_model.onnx", providers=['CUDAExecutionProvider'])

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(tensor_data)}
ort_outs = ort_session.run(None, ort_inputs)

print(ort_outs[0])

Output:

graph(%input : Float(68000, 16, strides=[16, 1], requires_grad=0, device=cuda:0), %batch_norm.weight : Float(16, strides=[1], requires_grad=1, device=cuda:0), %batch_norm.bias : Float(16, strides=[1], requires_grad=1, device=cuda:0), %batch_norm.running_mean : Float(16, strides=[1], requires_grad=0, device=cuda:0), %batch_norm.running_var : Float(16, strides=[1], requires_grad=0, device=cuda:0)): %output : Float(68000, 16, strides=[16, 1], requires_grad=1, device=cuda:0) = onnx::BatchNormalization[epsilon=1.0000000000000001e-05, momentum=0.90000000000000002](%input, %batch_norm.weight, %batch_norm.bias, %batch_norm.running_mean, %batch_norm.running_var) # /home/dell/anaconda3/envs/pointseg/lib/python3.8/site-packages/torch/nn/functional.py:2282:0 return (%output)

2023-09-04 11:03:58.430848248 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); 2023-09-04 11:03:58.430958361 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); Traceback (most recent call last): File "/home/dell/CLionProjects/NewSpconvOp/test_cases/bn_test.py", line 34, in ort_outs = ort_session.run(None, ort_inputs) File "/home/dell/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon);

Urgency

No response

Platform

Linux

OS Version

Ubuntu 18.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.15

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4 RTX 3060

Model File

No response

Is this a quantized model?

No

skottmckay commented 1 year ago

Most likely it's a limitation of the cuDNN cudnnBatchNormalizationForwardInference function that we're passing data through to.

One approach would be to split the input data, call BatchNormalization for each chunk, and merge the output.

That seems to be equal when done in python using the onnx unit test definition of batch norm (see here)

def _batchnorm_test_mode(x, s, bias, mean, var, epsilon=1e-5):  # type: ignore
    dims_x = len(x.shape)
    dim_ones = (1,) * (dims_x - 2)
    s = s.reshape(-1, *dim_ones)
    bias = bias.reshape(-1, *dim_ones)
    mean = mean.reshape(-1, *dim_ones)
    var = var.reshape(-1, *dim_ones)
    return s * (x - mean) / np.sqrt(var + epsilon) + bias

# 1. Generate random data
data = np.random.rand(68000, 16).astype(np.float32)

scale = np.random.randn(16).astype(np.float32)
bias = np.random.randn(16).astype(np.float32)
mean = np.random.randn(16).astype(np.float32)
var = np.random.rand(16).astype(np.float32)

# run with whole batch
a = _batchnorm_test_mode(data, scale, bias, mean, var).astype(np.float32)

# split, call batch norm for each chunk, and concat the results
data0, data1 = np.split(data, 2)
b0 = _batchnorm_test_mode(data0, scale, bias, mean, var).astype(np.float32)
b1 = _batchnorm_test_mode(data1, scale, bias, mean, var).astype(np.float32)
b = np.concatenate((b0, b1))

# validate
print(np.array_equal(a, b))

chester1uo commented 1 year ago

Yes, split the input data is a solution, however, if I need to deploy a complex model in C++, it will be difficult because there are lots of normalization operations. I am not sure whether this will also works in C++ because it look like I need to implement new batch normalization. Is this limitation can be solved in future version?

skottmckay commented 1 year ago

I believe the limitation is coming from cuDNN so that would be a question for nvidia. https://developer.nvidia.com/cudnn or cuDNN@nvidia.com.

chester1uo commented 1 year ago

I go to NVIDIA's website and ask for help, and here are some information I collect: There actually exist a limitations in data scale, it is batch size, and the value is 65535 Reference: https://github.com/pytorch/pytorch/issues/16998 https://github.com/MegEngine/MegEngine/issues/437

I turned the value of tensor shape in above Python code and that find that 65535 is threshold value. data = np.random.rand(65535, 16).astype(np.float32) # Works data = np.random.rand(65536, 16).astype(np.float32) # Error

However, if I use: data = np.random.rand(16, 65536).astype(np.float32), it also works

This make me confused, I am not sure whether ONNX runtime take tensor shape correctly and whether they have same tensor layout. I think it may that when pass tensor in such shape (n of each channel, channels), the first dim may be recognized as the batch size.

microsoft / onnxruntime