Use of torch InstanceNorm2d and dynamic tensor size causes crash

mattroos commented 3 years ago

Describe the bug When I export and then use a model than included an InstanceNorm2d layer, it often (but not always) crashes when using dynamic width.

Urgency I'm forced to abandon ONNX and try other methods for accelerating my model.

System information

Linux Ubuntu 16.04
ONNX Runtime installed from source
ONNX Runtime version: 1.7.0
Python version: 3.6.9
Visual Studio version (if applicable): N/A
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: 11.1
GPU model and memory: RTX2080, 8GB

To reproduce, and expected behavior

Run the code below. It builds, exports, and runs four models. The models use unitialized weights. Each of these four models are run 100 times, on random noise inputs.
1. Without InstanceNorm2d, without dynamic shape/width (always runs without error)
2. Without InstanceNorm2d, with dynamic shape/width (always runs without error)
3. With InstanceNorm2d, without dynamic shape/width (always runs without error)
4. With InstanceNorm2d, with dynamic shape/width (crashes regularly, but not on every trial)

The error from these four models (100 trials each) is below. The numbers are the image width for a given data sample and trials. Note that on this particular run, the last model (with InstanceNorm2d and dynamic width) ran successfully on the first trial with a data width of 1180, then crashed on or after the next trial, with a width of 512. There seems to be no discernible pattern relating data width to when a crash occurs. It happens regularly, but must be related to the data values in the input, or the model parameters (which are randomly initialized).

$ python demo_onnx_bug.py 
Running 100 samples: use_instancenorm2d=False, use_variable_test_width=False...
1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 
Passed.

Running 100 samples: use_instancenorm2d=False, use_variable_test_width=True...
48 816 1172 804 396 1152 1196 668 1008 332 600 1148 364 536 112 1060 284 288 372 360 796 648 208 172 80 788 140 476 232 496 140 420 1268 1276 532 600 928 188 560 436 996 1188 336 812 916 1084 544 740 148 916 1152 332 172 1220 144 496 1216 1120 524 492 1100 124 532 332 392 996 440 460 20 1120 628 108 388 608 484 352 668 444 1056 472 844 96 764 812 1020 1248 180 1052 188 1212 524 1260 336 784 952 24 868 1104 900 708 
Passed.

Running 100 samples: use_instancenorm2d=True, use_variable_test_width=False...
1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 1280 
Passed.

Running 100 samples: use_instancenorm2d=True, use_variable_test_width=True...
1180 512 #assertioninstanceNormalizationPlugin.cpp,307

The code

import os, sys
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.nn import Conv2d, InstanceNorm2d

import onnx
import onnxruntime

class CReLU(nn.Module):
    def __init__(self):
        super(CReLU, self).__init__()
    def forward(self, x):
        return torch.cat((F.leaky_relu(x, 0.01, inplace=True), F.leaky_relu(-x, 0.01, inplace=True)), 1)

class CReLU_IN(nn.Module):
    def __init__(self, channels, use_instancenorm2d=True):
        super(CReLU_IN, self).__init__()
        self.use_instancenorm2d = use_instancenorm2d
        if self.use_instancenorm2d:
            self.bn = nn.InstanceNorm2d(channels * 2, eps=1e-05, momentum=0.1, affine=True)
    def forward(self, x):
        x = torch.cat((x, -x), 1)
        if self.use_instancenorm2d:
            x = self.bn(x)
        return F.leaky_relu(x, 0.01, inplace=True)

class ModelFeatures(nn.Module):
    def __init__(self, use_instancenorm2d=True):
        super(ModelFeatures, self).__init__()
        self.layer0 = nn.Sequential(
            Conv2d(3, 16, 3, stride=1, padding=1, bias=False),
            CReLU_IN(16, use_instancenorm2d=use_instancenorm2d),
            Conv2d(32, 32, 3, stride=2, padding=1, bias=False),
            CReLU_IN(32, use_instancenorm2d=use_instancenorm2d)
        )
        self.layer0_1 = nn.Sequential(
            Conv2d(64, 64, 3, stride=1, padding=1, bias=False),
            nn.ReLU(),
            Conv2d(64, 64, 3, stride=2, padding=1, bias=False),
            nn.ReLU(inplace=True)
        )    
    def forward(self, x):
        x = self.layer0(x)
        focr = self.layer0_1(x)
        return focr

def test_onnx_model(use_instancenorm2d=True, convert_width=1280, use_variable_test_width=False):
    ## Build the model in pytorch
    net = ModelFeatures(use_instancenorm2d=use_instancenorm2d)
    net = net.eval()  # Eval mode rather than training mode
    net = net.cuda()

    ## Create an ONNX model. Start by getting a traceable input.
    batch_size = 1
    x = 2 * torch.rand(batch_size, 3, 40, convert_width, requires_grad=True) - 1
    x = x.cuda()
    features = net(x)

    # Export the model
    model_pathname = './model_features.onnx'
    torch.onnx.export(net,                        # model being run
                      x,                          # model input (or a tuple for multiple inputs)
                      model_pathname,             # where to save the model (can be a file or file-like object)
                      export_params = True,       # store the trained parameter weights inside the model file
                      opset_version = 12,         # the ONNX version to export the model to
                      do_constant_folding = True, # whether to execute constant folding for optimization
                      input_names = ['input'],    # the model's input names
                      output_names = ['features'], #, # the model's output names
                      dynamic_axes = {'input':{0:'batch_size', 3:'width'},  # variable length axes
                                      'features':{0:'batch_size', 3:'width'}})

    # Load and verify that the model schematic is valid
    onnx_model = onnx.load(model_pathname)
    onnx.checker.check_model(onnx_model)

    del net
    del onnx_model

    ## Use the ONNX model
    # os.environ["ORT_TENSORRT_CACHE_PATH"] = os.path.expanduser('~') + '/.gatekeeper_cache/'
    os.environ["ORT_TENSORRT_FP16_ENABLE"] = "0"  # Disable FP16 precision
    os.environ["ORT_TENSORRT_INT8_ENABLE"] = "0"  # Disable INT8 precision
    os.environ["ORT_TENSORRT_ENGINE_CACHE_ENABLE"] = "0"  # Disable engine caching

    ort_session = onnxruntime.InferenceSession(model_pathname)
    # print(f'Starting session.')

    print(f'Running 100 samples: use_instancenorm2d={use_instancenorm2d}, use_variable_test_width={use_variable_test_width}...')
    for i in range(100):
        if use_variable_test_width:
            width = np.random.randint(4, (convert_width)//4+1) * 4  # variable width, as multiple of 4
        else:
            width = convert_width
        # print(f'Processing sample of width {width}')
        print(f'{width} ', end='')
        sys.stdout.flush()
        x = 2 * np.random.uniform(size=(batch_size, 3, 40, width)).astype(np.float32)
        ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)
        ort_inputs = {ort_session.get_inputs()[0].name: ortvalue}
        ort_outs = ort_session.run(None, ort_inputs)
    print('\nPassed.\n')

####################
## DO THE TESTING ##
####################

convert_width = 1280  # Use convert_width that is multiple of 4

## Test WITHOUT instancenorm2d and WITHOUT fixed width samples (equal to conversion width).
test_onnx_model(use_instancenorm2d=False, convert_width=convert_width, use_variable_test_width=False)

## Test WITHOUT instancenorm2d and WITH variable width samples.
test_onnx_model(use_instancenorm2d=False, convert_width=convert_width, use_variable_test_width=True)

## Test WITH instancenorm2d and WITHOUT fixed width samples (equal to conversion width).
test_onnx_model(use_instancenorm2d=True, convert_width=convert_width, use_variable_test_width=False)

## Test WITH instancenorm2d and WITH variable width samples.
test_onnx_model(use_instancenorm2d=True, convert_width=convert_width, use_variable_test_width=True)

ytaous commented 3 years ago

Hi, @mattroos - how often the case 4 true/true would crash? I tried your code using master branch dated on 04/26, I can't repro it with 40 runs. Perhaps I can try again with 1.7.0 later when I get a chance. Now that we have released 1.7.2, I wonder if you can also try it? Thanks.

mattroos commented 3 years ago

@ytaous, in my code, the model it is run for 100 trials, and it crashes on one of those 100 trials on nearly every execution of the code. I'll install 1.7.2 and see if that changes anything. Thanks.

mattroos commented 3 years ago

@ytaous rather than rebuild, I just did a pip install of 1.7.0 (I had previously been using my own build of 1.7.0). Specifically I did a pip install onnxruntime-gpu. After doing so, the crashes stopped occurring. However, something is strange and I don't think it's actually creating the engines correctly (apologies if my terminology is not correct), and/or may always be loading from cache. And yet, it never seems to cache the engines even when requested to, or, it is saving them somewhere other than my specified cache path. Even if I use this ...

os.environ["ORT_TENSORRT_CACHE_PATH"] = os.path.expanduser('~') + '/.gatekeeper_cache/'
os.environ["ORT_TENSORRT_FP16_ENABLE"] = "0"  # Disable/enable FP16 precision
os.environ["ORT_TENSORRT_INT8_ENABLE"] = "0"  # Disable/enable INT8 precision
os.environ["ORT_TENSORRT_ENGINE_CACHE_ENABLE"] = "1"  # Disable/enable engine caching

... prior to calling InferenceSession(), there are no files saved in the specified cache path afterwards. Any advice?

[EDIT]: Oh, I see now that the onnxruntime-gpu is a generic GPU implementation, and doesn't use TensorRT. I'll try building from source again.

microsoft / onnxruntime

Use of torch InstanceNorm2d and dynamic tensor size causes crash #7572