[Inference] ONNXRuntime inference crash on GPU without any log but not on CPU

leoc70 commented 1 year ago

Describe the issue

I trained a PyTorch image segmentation model in python and converted it to an ONNX model. The inference in python on CPU or GPU is working. In my C# application (.NET 6) the inference on CPU is fine but when I try to run it GPU my application crash without any exception.

I have only an event in the Windows 10 Event Viewer :

Faulting application name: DeepLearningONNX.exe, version: 1.0.0.0, time stamp: 0x6331eb0e Faulting module name: cudnn64_8.dll, version: 6.14.11.6050, time stamp: 0x62e9c226 Exception code: 0xc0000409 Fault offset: 0x000000000001420d Faulting process id: 0x2cc0 Faulting application start time: 0x01d8f830aac6f0a2 Faulting application path: C:\R&D\DeepLearningONNX\DeepLearningONNX\bin\x64\Debug\net6.0-windows\DeepLearningONNX.exe Faulting module path: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\cudnn64_8.dll Report Id: 40803e1a-e84d-4645-bfb6-4ebbb6ba1b78 Faulting package full name: Faulting package-relative application ID:

I installed CUDA v11.6 and extrated CUDNN v8.5.0.96 and add the following environnement system variables :

CUDA_PATH : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6 CUDA_PATH_V11_6 : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6 PATH : C:\Program Files\NVIDIA\CUDNN\v8.5;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\libnvvp

To reproduce

My models in pytorch or onnx format : https://github.com/leoc70/ONNXRuntime-model-debug

Here is my code in Python and C#

Python 3.10 64bit

import torch # version 1.12.1+cu116
from torch import nn
import segmentation_models_pytorch as smp
from segmentation_models_pytorch.losses import DiceLoss

class SegmentationModel(nn.Module):
  def __init__(self):
    super(SegmentationModel, self).__init__()

    self.arc = smp.UnetPlusPlus(encoder_name= 'timm-efficientnet-b0',
                        encoder_weights='imagenet',
                        in_channels= 3,
                        classes = 1,
                        activation=None)

  def forward(self,images, masks=None):
    logits = self.arc(images)

    if masks != None :
      loss1 =DiceLoss(mode='binary')(logits, masks)
      loss2 = nn.BCEWithLogitsLoss()(logits, masks)
      return logits, loss1+loss2

    return logits

modelPath = "D:/model.pt"
device = "cuda"#input("Enter device (cpu or cuda) : ")
model = SegmentationModel()
model.to(device);
model.load_state_dict(torch.load(modelPath,map_location=torch.device(device) ))
model.eval()

dummy_input = torch.randn(1,3,128,128,device=device)

torch.onnx.export(model,         # model being run 
        dummy_input,       # model input (or a tuple for multiple inputs) 
        "model.onnx",       # where to save the model  
        export_params=True,  # store the trained parameter weights inside the model file 
        do_constant_folding=True,  # whether to execute constant folding for optimization 
        input_names = ['modelInput'],   # the model's input names 
        output_names = ['modelOutput'], # the model's output names 
        dynamic_axes={'modelInput' : [0,2,3],    # variable length axes 

                    'modelOutput' : [0,2,3]})

C .NET 6

private void InferenceDebug(string modelPath, bool useGPU)
        {
            InferenceSession session;

            if (useGPU)
            {
                var cudaProviderOptions = new OrtCUDAProviderOptions();
                var providerOptionsDict = new Dictionary<string, string>();
                providerOptionsDict["device_id"] = "0";
                providerOptionsDict["gpu_mem_limit"] = "2147483648";
                providerOptionsDict["arena_extend_strategy"] = "kSameAsRequested";
                providerOptionsDict["cudnn_conv_algo_search"] = "DEFAULT";
                providerOptionsDict["do_copy_in_default_stream"] = "1";
                providerOptionsDict["cudnn_conv_use_max_workspace"] = "1";
                providerOptionsDict["cudnn_conv1d_pad_to_nc1d"] = "1";

                cudaProviderOptions.UpdateOptions(providerOptionsDict);

                SessionOptions options = SessionOptions.MakeSessionOptionWithCudaProvider(cudaProviderOptions);
                session = new InferenceSession(modelPath, options);
            }
            else
                session = new InferenceSession(modelPath);

            int w = 128;
            int h = 128;
            Tensor<float> input = new DenseTensor<float>(new int[] { 1, 3, h, w });
            Random random = new Random(42);

            for (int y = 0; y < h; y++)
            {
                for (int x = 0; x < w; x++)
                {
                    input[0, 0, y, x] = (float)(random.NextDouble() / 255);
                    input[0, 1, y, x] = (float)(random.NextDouble() / 255);
                    input[0, 2, y, x] = (float)(random.NextDouble() / 255);
                }
            }

            var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor<float>("modelInput", input) };
            using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs); // The crash is when executing this line
        }

Urgency

I am working on a project coded in C#. So doing the inference have to be done in C# but the training can be in python. If I am not able to run the inference in C# with good performance (<400ms), I will have to find a other solution.

Platform

Windows

OS Version

Windows 10 22H2 OS build 19045.2251

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

nuget Microsoft.ML.OnnxRuntime.Gpu version 1.13.1

ONNX Runtime API

C#

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.6

yuslepukhin commented 1 year ago

I have the same version of CUDA installed including the referenced above cudnn64_8.dll, version: 6.14.11.6050

I have insignificantly modified your program; I am attaching it here. I have targeted netcoreapp3.1 and using version 1.13.1.

But I am not able to reproduce the problem. This is what below produces:

D:\dev\data\gh_issue_13658$ D:\dev\data\gh_issue_13658\GPUCrash\bin\Debug\netcoreapp3.1\GPUCrash.exe D:\dev\data\gh_issue_13658\ONNXRuntime-model-debug\model.onnx Program Finished

using Microsoft.ML.OnnxRuntime.Tensors;
using Microsoft.ML.OnnxRuntime;
using System;
using System.Collections.Generic;

namespace GPUCrash
{
  internal class Program
  {
    static void Main(string[] args)
    {
      var modelPath = args[0];
      bool useGPU = true;
      InferenceSession session = null;
      if (useGPU)
      {
        var cudaProviderOptions = new OrtCUDAProviderOptions();
        var providerOptionsDict = new Dictionary<string, string>();
        providerOptionsDict["device_id"] = "0";
        providerOptionsDict["gpu_mem_limit"] = "2147483648";
        providerOptionsDict["arena_extend_strategy"] = "kSameAsRequested";
        providerOptionsDict["cudnn_conv_algo_search"] = "DEFAULT";
        providerOptionsDict["do_copy_in_default_stream"] = "1";
        providerOptionsDict["cudnn_conv_use_max_workspace"] = "1";
        providerOptionsDict["cudnn_conv1d_pad_to_nc1d"] = "1";

        cudaProviderOptions.UpdateOptions(providerOptionsDict);

        using var options = SessionOptions.MakeSessionOptionWithCudaProvider(cudaProviderOptions);
        session = new InferenceSession(modelPath, options);
      }
      else
        session = new InferenceSession(modelPath);

      using var sess = session;

      int w = 128;
      int h = 128;
      Tensor<float> input = new DenseTensor<float>(new int[] { 1, 3, h, w });
      Random random = new Random(42);

      for (int y = 0; y < h; y++)
      {
        for (int x = 0; x < w; x++)
        {
          input[0, 0, y, x] = (float)(random.NextDouble() / 255);
          input[0, 1, y, x] = (float)(random.NextDouble() / 255);
          input[0, 2, y, x] = (float)(random.NextDouble() / 255);
        }
      }

      var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor<float>("modelInput", input) };
      using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs); // The crash is when executing this line
      System.Console.WriteLine("Program Finished");
    }
  }
}

My recommendation would be to stop in the debugger right before Run() and examine where the native CUDA dlls are being loaded from.

In my case. CUDA libraries are loaded from here:

And, what is also very important, the onnxruntime libraries are loaded from 1.13.1 restored NuGet package, and not from anywhere else.

leoc70 commented 1 year ago

@yuslepukhin I did what you mention and I found my mistake. I forgot to download the Zlib dll and add it to my PATH. After that everything was running fine.

yuslepukhin commented 1 year ago

@yuslepukhin I did what you mention and I found my mistake. I forgot to download the Zlib dll and add it to my PATH. After that everything was running fine.

I suggest you review the DLL search order on Windows (and on other OS you might use), PATH is just one of the many things that affect it. In the dev environment that I demonstrated above PATH has nothing to do with it.

Then we can spend more time discussing onnxruntime.

microsoft / onnxruntime