pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.04k stars 22k forks source link

Model outputs different values after ONNX export #87398

Closed Wantcha closed 1 year ago

Wantcha commented 1 year ago

šŸ› Describe the bug

I've been trying to convert my GNN to ONNX, but I'm finding that after I export the ONNX and try to execute it through the onnxruntime with the same inputs as in PytorchGeometric, the outputs are different. What is the reason behind that?

Here is a basic code demo where the problem occurs:

from typing import OrderedDict
import numpy as np
import torch as th
import torch.nn as nn
import torch.onnx
import onnxruntime

class MLP(nn.Module):
    '''
    Multilayer Perceptron.
    '''
    def __init__(self, hidden_size: int, num_hidden_layers: int, output_size: int):
        super(MLP, self).__init__()
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.output_size = output_size

        self.initialized = False

    def _initialize(self, inputs : th.Tensor):
        if not self.initialized:
            input_size = inputs.shape[1]

            l = OrderedDict()
            l['input'] = nn.Linear(input_size, self.hidden_size)
            l['relu_in'] = nn.ReLU()
            for i in range(self.num_hidden_layers):
                l['h%d' % i] = nn.Linear(self.hidden_size, self.hidden_size)
                l['relu%d' % i] = nn.ReLU()
            l['out'] = nn.Linear(self.hidden_size, self.output_size)

            self.layers = nn.Sequential(l)
            self.initialized = True

    def forward(self, x):
        self._initialize(x)

        return self.layers(x)

class GraphIndependentModule(nn.Module):
    def __init__(self, node_model):
        super(GraphIndependentModule, self).__init__()
        self.node_model = node_model

    def forward(self, x : th.Tensor):
        x = self.node_model(x)
        return x

class GraphNetwork(nn.Module):
    def __init__(self):
        super(GraphNetwork, self).__init__()

        node_encode_model = th.nn.Sequential( MLP(128, 2, 128), th.nn.LayerNorm(128) )
        self.encoder_network = GraphIndependentModule(node_encode_model)
        self.decoder_network = MLP(128, 2, 3)

    def forward(self, x: th.Tensor) -> th.Tensor:
        node_feats = x.clone().detach()

        node_feats = self.encoder_network(node_feats)

        return self.decoder_network(node_feats)

if __name__ == "__main__":
    num_nodes = 300

    x = th.rand(num_nodes, 9)

    model = GraphNetwork()
    input_values = (x)
    input_names = ['node_attr']

    model.eval()
    result = model(x).detach().numpy()

    np.set_printoptions(threshold=6)

    print(result)

    torch.onnx.export(model, input_values, "H:\\Animating Tools\\Projects\\Houdini\\LearningPhysics\\scripts\\test_model.onnx", opset_version=16, input_names=input_names,
                        output_names=['coords'], dynamic_axes={'node_attr':{0:'num_nodes'}}, verbose=False)

    ort_session = onnxruntime.InferenceSession('test_model.onnx')

    def to_numpy(tensor):
        return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

    ort_inputs = { ort_session.get_inputs()[0].name: to_numpy(x)}
    ort_outs = ort_session.run(None, ort_inputs)

    output = ort_outs[0]

    print('----')

    print(output)

Versions

PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro GCC version: Could not collect Clang version: 10.0.0 CMake version: version 3.20.21032501-MSVC_2 Libc version: N/A

Python version: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044-SP0 Is CUDA available: True CUDA runtime version: 11.3.109 GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Nvidia driver version: 516.94 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.3 [pip3] numpydoc==1.4.0 [pip3] torch==1.12.1+cu113 [pip3] torch-cluster==1.6.0 [pip3] torch-geometric==2.1.0.post1 [pip3] torch-scatter==2.0.9 [pip3] torch-sparse==0.6.15 [pip3] torch-spline-conv==1.2.1 [pip3] torchaudio==0.12.1+cu113 [pip3] torchvision==0.13.1+cu113 [conda] Could not collect

Wantcha commented 1 year ago

@rusty1s

lminer commented 1 year ago

I'm having the same issue

warmmilk-sudo commented 1 year ago

I'm having the same issue

Wantcha commented 1 year ago

Any news on this? I'm working on a project that requires this functionality to work properly, and this issue is completely blocking me

thiagocrepaldi commented 1 year ago

Some discrepancy is expected, since PyTorch and ONNX Runtime implement kernels differently

However, if you change pytorch output to pt_result and ort output to ort_result and compare them with assert np.allclose(pt_result, ort_result, rtol=1e-3, atol=1e-6), you will notice that the assert will pass.

For ONNX model conversions, rtol=1e-3 and atol=1e-6 are considered good thresholds

thiagocrepaldi commented 1 year ago

Closing this issue since author does not reply for the last 2 months. Please feel free to reopen if this is still an issue

HoaNguyen55 commented 1 year ago

@thiagocrepaldi Hi, I don't understand this command you mentioned np.allclose(pt_result, ort_result, rtol=1e-3, atol=1e-6). What does it mean ? Could you please give me more details or related example to fix the problem about ONNX model return different result with MLModel.zip or pytorch model ?

vavanade commented 1 year ago

Hi, I don't understand this command you mentioned np.allclose(pt_result, ort_result, rtol=1e-3, atol=1e-6). What does it mean ?

allclose is a NumPy function. Here is its documentation: https://numpy.org/doc/stable/reference/generated/numpy.allclose.html

It checks whether the values of the two results stored in pt_result and ort_result are all sufficiently close to each other.