Open chester1uo opened 1 year ago
Most likely it's a limitation of the cuDNN cudnnBatchNormalizationForwardInference function that we're passing data through to.
One approach would be to split the input data, call BatchNormalization for each chunk, and merge the output.
That seems to be equal when done in python using the onnx unit test definition of batch norm (see here)
def _batchnorm_test_mode(x, s, bias, mean, var, epsilon=1e-5): # type: ignore
dims_x = len(x.shape)
dim_ones = (1,) * (dims_x - 2)
s = s.reshape(-1, *dim_ones)
bias = bias.reshape(-1, *dim_ones)
mean = mean.reshape(-1, *dim_ones)
var = var.reshape(-1, *dim_ones)
return s * (x - mean) / np.sqrt(var + epsilon) + bias
# 1. Generate random data
data = np.random.rand(68000, 16).astype(np.float32)
scale = np.random.randn(16).astype(np.float32)
bias = np.random.randn(16).astype(np.float32)
mean = np.random.randn(16).astype(np.float32)
var = np.random.rand(16).astype(np.float32)
# run with whole batch
a = _batchnorm_test_mode(data, scale, bias, mean, var).astype(np.float32)
# split, call batch norm for each chunk, and concat the results
data0, data1 = np.split(data, 2)
b0 = _batchnorm_test_mode(data0, scale, bias, mean, var).astype(np.float32)
b1 = _batchnorm_test_mode(data1, scale, bias, mean, var).astype(np.float32)
b = np.concatenate((b0, b1))
# validate
print(np.array_equal(a, b))
Yes, split the input data is a solution, however, if I need to deploy a complex model in C++, it will be difficult because there are lots of normalization operations. I am not sure whether this will also works in C++ because it look like I need to implement new batch normalization. Is this limitation can be solved in future version?
I believe the limitation is coming from cuDNN so that would be a question for nvidia. https://developer.nvidia.com/cudnn or cuDNN@nvidia.com.
I go to NVIDIA's website and ask for help, and here are some information I collect: There actually exist a limitations in data scale, it is batch size, and the value is 65535 Reference: https://github.com/pytorch/pytorch/issues/16998 https://github.com/MegEngine/MegEngine/issues/437
I turned the value of tensor shape in above Python code and that find that 65535 is threshold value. data = np.random.rand(65535, 16).astype(np.float32) # Works data = np.random.rand(65536, 16).astype(np.float32) # Error
However, if I use: data = np.random.rand(16, 65536).astype(np.float32), it also works
This make me confused, I am not sure whether ONNX runtime take tensor shape correctly and whether they have same tensor layout. I think it may that when pass tensor in such shape (n of each channel, channels), the first dim may be recognized as the batch size.
Describe the issue
Hello, when I process a data shape in (70000,16) during inference, it has error message:
2023-09-04 10:54:38.941019182 [E:onnxruntime:Model, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); 2023-09-04 10:54:38.941211047 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); terminate called after throwing an instance of 'Ort::Exception' what(): Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_2' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon);
It looks like the BatchNormalization implemented by cuDNN in onnxruntime doesn't support data size like this. I try to reduce the number of channels but not help, and then I reduce the number of data in each channel below 50000 then it works.
I wonder that is there any exact limitations on the data scale when using batch normalization? And how to solve if I really need to use data in this scale in inference process.
To reproduce
This issue will happen in both C++ and Python API.
Here is one case in Python:
Output:
graph(%input : Float(68000, 16, strides=[16, 1], requires_grad=0, device=cuda:0), %batch_norm.weight : Float(16, strides=[1], requires_grad=1, device=cuda:0), %batch_norm.bias : Float(16, strides=[1], requires_grad=1, device=cuda:0), %batch_norm.running_mean : Float(16, strides=[1], requires_grad=0, device=cuda:0), %batch_norm.running_var : Float(16, strides=[1], requires_grad=0, device=cuda:0)): %output : Float(68000, 16, strides=[16, 1], requires_grad=1, device=cuda:0) = onnx::BatchNormalization[epsilon=1.0000000000000001e-05, momentum=0.90000000000000002](%input, %batch_norm.weight, %batch_norm.bias, %batch_norm.running_mean, %batch_norm.running_var) # /home/dell/anaconda3/envs/pointseg/lib/python3.8/site-packages/torch/nn/functional.py:2282:0 return (%output)
2023-09-04 11:03:58.430848248 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); 2023-09-04 11:03:58.430958361 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon); Traceback (most recent call last): File "/home/dell/CLionProjects/NewSpconvOp/test_cases/bn_test.py", line 34, in
ort_outs = ort_session.run(None, ort_inputs)
File "/home/dell/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'BatchNormalization_0' Status Message: CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/nn/batch_norm.cc ; line=159 ; expr=BatchNormalizationForwardInferenceHelper( GetCudnnHandle(p_op_kernel_context), cudnn_batch_normmode, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, vardata, epsilon);
Urgency
No response
Platform
Linux
OS Version
Ubuntu 18.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.15
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.4 RTX 3060
Model File
No response
Is this a quantized model?
No