microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.65k stars 2.93k forks source link

A runtime can run on cuda device 0 but fail on cuda device 1 #14710

Open 1049451037 opened 1 year ago

1049451037 commented 1 year ago

Describe the issue

I have a onnx file that can run normally on cuda device 0, but raises this error when I run it on device 1:

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Einsum node. Name:'/input_blocks.4/input_blocks.4.1/transformer_blocks.0/attn1/Einsum' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc:298 std::unique_ptr<onnxruntime::Tensor> onnxruntime::EinsumOp::Transpose(const onnxruntime::Tensor&, const onnxruntime::TensorShape&, const gsl::span<const long unsigned int>&, onnxruntime::AllocatorPtr, void*, const Transpose&) 21Einsum op: Transpose failed: CUDA failure 1: invalid argument ; GPU=1 ; hostname=3eee3adbcb74 ; expr=cudaMemcpyAsync(output.MutableDataRaw(), input.DataRaw(), input.Shape().Size() * input.DataType()->Size(), cudaMemcpyDeviceToDevice, stream); 

2023-02-16 06:44:53.137742920 [E:onnxruntime:Default, cuda_call.cc:119 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=1 ; hostname=3eee3adbcb74 ; expr=cudaLaunchHostFunc(static_cast<cudaStream_t>(GetHandle()), ReleaseCpuBufferCallback, cpu_buffers_info.release()); 
2023-02-16 06:44:53.142779409 [E:onnxruntime:Default, cuda_call.cc:119 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=1 ; hostname=3eee3adbcb74 ; expr=cudnnDestroy(cudnn_handle_);

I also tried to run on device 2, same error happens.

To reproduce

Download the following onnx: https://cloud.tsinghua.edu.cn/f/4f0a921584564e45be6d/?dl=1

Run it with python:

import numpy as np
import onnxruntime as ort
def inference_onnx(input_0, input_1, input_2, input_3):
    ort_sess = ort.InferenceSession('onnx/EfficientUNetModel_.onnx', providers=[('CUDAExecutionProvider', {'device_id': 1}), 'CPUExecutionProvider'])
    outputs = ort_sess.run(None, {
        'input_0':input_0,
        'input_1':input_1,
        'input_2':input_2,
        'input_3':input_3
        })
    return outputs[0]
inference_onnx(
    np.random.rand(16, 3, 256, 256).astype('f'),
    np.array([1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3], dtype=np.int32),
    np.random.rand(16, 64, 640).astype('f'),
    np.random.rand(16, 81, 640).astype('f'),
)

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

stevenlix commented 1 year ago

The model seems big and can't run in 16GB GPU. What GPU and memory size are you using?

1049451037 commented 1 year ago

I run on a 24GB RTX 3090.

riyaj8888 commented 1 year ago

any update on this , i am also facing same error . i can run the onnx model on gpu=0 but can't run on gpu=1.

[E:onnxruntime:log, cuda_call.cc:119 CudaCall] CUDA failure 1: invalid argument ; GPU=1 ; hostname=90944f8d90dc ; expr=cudaMemcpyAsync(output.MutableDataRaw(), input.DataRaw(), input.Shape().Size() * input.DataType()->Size(), cudaMemcpyDeviceToDevice, stream); 2023-07-20 09:23:03.925983174 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running Einsum node. Name:'/model/layer.0/rel_attn/Einsum_8' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cpu/math/einsum_utils/einsum_auxiliary_ops.cc:298 std::unique_ptr<onnxruntime::Tensor> onnxruntime::EinsumOp::Transpose(const onnxruntime::Tensor&, const onnxruntime::TensorShape&, const gsl::span<const long unsigned int>&, onnxruntime::AllocatorPtr, void*, const Transpose&) 21Einsum op: Transpose failed: CUDA failure 1: invalid argument ; GPU=1 ; hostname=90944f8d90dc ; expr=cudaMemcpyAsync(output.MutableDataRaw(), input.DataRaw(), input.Shape().Size() * input.DataType()->Size(), cudaMemcpyDeviceToDevice, stream);