microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.59k stars 2.92k forks source link

One ONNX graph will result in an "UpdateState bias should be 1D" Error on GPU but success on CPU #11241

Open maybeLee opened 2 years ago

maybeLee commented 2 years ago

Describe the bug The graph that result in this issue (can be accessed through this link): image

When I run this graph on CPU mode, it works normally with no error, but when I run this graph on GPU mode, it throws a runtime error as follows:

[/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py](https://localhost:8080/#) in run(self, output_names, input_feed, run_options)
    190             output_names = [output.name for output in self._outputs_meta]
    191         try:
--> 192             return self._sess.run(output_names, input_feed, run_options)
    193         except C.EPFail as err:
    194             if self._enable_fallback:

Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running FusedConv node. Name:'model_44/conv2d_37/Conv2D' Status Message: conv.cc:223 UpdateState bias should be 1D

To Reproduce

Please first download the onnx graph through: https://drive.google.com/file/d/1mHUJuhjUxGjjWmpwwXe8NS-jOXhEts4y/view?usp=sharing Then run the following code:

import onnxruntime as ort
providers = [
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 10 * 1024 * 1024 * 1024,  # 10G
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
        'do_copy_in_default_stream': True,
    }),
    'CPUExecutionProvider',
]
onnx_path = "bug.onnx"
model = ort.InferenceSession(onnx_path, providers=providers)

import numpy as np
input = np.random.rand(10, 36, 36, 528)

input_name = model.get_inputs()[0].name
output_name = model.get_outputs()[0].name
input = input.astype('float32')
pred = model.run([output_name], {input_name: input})[0]
print("When running on GPU: ", pred[0])

model.set_providers(['CPUExecutionProvider'])
input_name = model.get_inputs()[0].name
output_name = model.get_outputs()[0].name
input = input.astype('float32')
pred = model.run([output_name], {input_name: input})[0]
print("When running on CPU: ", pred[0])

System information

Expected behavior ONNXRuntime should behave the same between GPU context and CPU context. And in this case, I think it should work normally in GPU mode.

Jinming-Su commented 2 years ago

I happen the same problem. Have you solved this?

yunho-c commented 2 years ago

Encountered the same problem while converting PiDiNet into ONNX.

EdVince commented 2 years ago

I think the error message tells us everything. The error message tells us that it happened "while running FusedConv node" and the error is "bias should be 1D". In this graph, it may not be obvious that in the "Add" operator it is actually a bias computation, from “FusedConv” , we can know that onnxruntime is trying to fuse the graph, so it is easy to know that it is trying to fuse bias and conv together, which is a very common graph optimization operation. But in the traditional sense of fusing bias and conv, the bias here is actually a 1D data, which should come from Conv, for example, torch.Conv2D(bias=True) the bias generated here. But in this graph, the bias here is in fact a 4D data, which is actually not the true bias with conv. So the solution is simple, since it appears in the "FusedConv" process, as long as we don't fuse it, we won't report an error. By consulting onnxruntime's doc we can simply add a few lines of code that will disable the fusion and make it work successfully:

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
model = ort.InferenceSession(onnx_path, sess_options=sess_options, providers=providers)
SinnCyann commented 1 year ago

I happen the same problem. Have you solved this?

add bias in Conv at the original model can solve this problem.

whateverforever commented 11 months ago

Same issue here, thanks for the workaround. It seems like a fix has been merged, so this can be closed, no?