Model outputs different values after ONNX export

Wantcha commented 1 year ago

🐛 Describe the bug

I've been trying to convert my GNN to ONNX, but I'm finding that after I export the ONNX and try to execute it through the onnxruntime with the same inputs as in PytorchGeometric, the outputs are different. What is the reason behind that?

Here is a basic code demo where the problem occurs:

from typing import OrderedDict
import numpy as np
import torch as th
import torch.nn as nn
import torch.onnx
import onnxruntime

class MLP(nn.Module):
    '''
    Multilayer Perceptron.
    '''
    def __init__(self, hidden_size: int, num_hidden_layers: int, output_size: int):
        super(MLP, self).__init__()
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.output_size = output_size

        self.initialized = False

    def _initialize(self, inputs : th.Tensor):
        if not self.initialized:
            input_size = inputs.shape[1]

            l = OrderedDict()
            l['input'] = nn.Linear(input_size, self.hidden_size)
            l['relu_in'] = nn.ReLU()
            for i in range(self.num_hidden_layers):
                l['h%d' % i] = nn.Linear(self.hidden_size, self.hidden_size)
                l['relu%d' % i] = nn.ReLU()
            l['out'] = nn.Linear(self.hidden_size, self.output_size)

            self.layers = nn.Sequential(l)
            self.initialized = True
            print("INITIALIZED MLP")

    def forward(self, x):
        self._initialize(x)

        return self.layers(x)

def build_mlp_with_layer_norm(hidden_size: int, num_hidden_layers: int, output_size: int) -> th.nn.Module:
    mlp = MLP(hidden_size, num_hidden_layers, output_size)
    return th.nn.Sequential( mlp, th.nn.LayerNorm(output_size) )

class GraphIndependentModule(nn.Module):
    def __init__(self, node_model):
        super(GraphIndependentModule, self).__init__()
        self.node_model = node_model

    def forward(self, x : th.Tensor):
        x = self.node_model(x)
        return x

class GraphNetwork(nn.Module):
    def __init__(self):
        super(GraphNetwork, self).__init__()

        mlp_hidden_size = 128
        mlp_num_hidden_layers = 2
        mlp_latent_size = 128

        node_encode_model = build_mlp_with_layer_norm(mlp_hidden_size, mlp_num_hidden_layers, mlp_latent_size)
        self.encoder_network = GraphIndependentModule(node_encode_model)
        self.decoder_network = MLP(mlp_hidden_size, mlp_num_hidden_layers, 3)

    def forward(self, x: th.Tensor) -> th.Tensor:
        node_feats = x.clone().detach()
        #edge_feats = edge_attr.clone().detach()

        node_feats = self.encoder_network(node_feats)

        return self.decoder_network(node_feats)

if __name__ == "__main__":

    num_edges = np.random.randint(1000, 5000)
    num_nodes = 300

    edge_attr = th.rand(num_edges, 1)

    x = th.rand(num_nodes, 9)

    model = GraphNetwork()
    input_values = (x)
    input_names = ['node_attr']

    result = model(x).detach().numpy()

    np.set_printoptions(threshold=6)

    print(result)

    torch.onnx.export(model, input_values, "H:\\Animating Tools\\Projects\\Houdini\\LearningPhysics\\scripts\\test_model.onnx", opset_version=16, input_names=input_names,
                        output_names=['coords'], dynamic_axes={'node_attr':{0:'num_nodes'}}, verbose=False)

    ort_session = onnxruntime.InferenceSession('test_model.onnx')

    def to_numpy(tensor):
        return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

    ort_inputs = { ort_session.get_inputs()[0].name: to_numpy(x)}
    ort_outs = ort_session.run(None, ort_inputs)

    output = ort_outs[0]

    print(output)

Environment

PyG version: 2.1
PyTorch version: 1.12.1
OS: Windows 10
Python version: 3.9.13
How you installed PyTorch and PyG (conda, pip, source): pip

rusty1s commented 1 year ago

Sorry for the late reply. I cannot reproduce your issue - both outputs are identical.

[[0.10058211 0.04576603 0.06820673]
 [0.09406289 0.03747898 0.06547496]
 [0.08338989 0.04281291 0.06986529]
 ...
 [0.08652009 0.04229797 0.06420554]
 [0.09935425 0.05514642 0.05666434]
 [0.08517656 0.03138043 0.06819256]]
-------
[[0.10058212 0.04576601 0.06820675]
 [0.09406289 0.03747899 0.06547496]
 [0.08338988 0.04281291 0.06986529]
 ...
 [0.08652008 0.04229799 0.06420554]
 [0.09935425 0.05514641 0.05666437]
 [0.08517655 0.03138043 0.06819253]]

Your example also doesn't include a GNN, so I am wondering whether this is really PyG related?

Wantcha commented 1 year ago

How is this possible? I've been running this a lot of times and I can't seem to get similar results:

[[-0.06470878 -0.08579458  0.06466776]
 [-0.07017429 -0.09012724  0.07015721]
 [-0.07162452 -0.09181573  0.07169908]
 ...
 [-0.0697773  -0.09162523  0.06534937]
 [-0.07019386 -0.09201382  0.06785648]
 [-0.0749575  -0.08793869  0.06971242]]
----
[[-0.00681989 -0.08922639  0.11673598]
 [-0.00468357 -0.09578735  0.12119135]
 [-0.01658172 -0.10310581  0.12445605]
 ...
 [-0.00799159 -0.09814378  0.11381167]
 [-0.00484691 -0.09971494  0.1233532 ]
 [-0.01747447 -0.10316211  0.1262483 ]]

I posted this here because initially, I assumed it was an issue with PyG, as my original project was using graph networks. Where should I post this?

rusty1s commented 1 year ago

I think this issue seems to be PyTorch related. At best, create an issue in the PyTorch repo and ping me there.

rusty1s commented 1 year ago

Setting opset_version=16 can solve the issue, as described in #5921.

Wantcha commented 1 year ago

Setting opset to 16 absolutely does not fix the issue for me. If you look at the code I provided, the opset is set to 16, yet my values are still off

rusty1s commented 1 year ago

Ok, thanks for confirming. Will re-open :(

fuhengwu2021 commented 1 year ago

What is your onnxruntime version? @Wantcha

lyimage commented 8 months ago

Demo code:

import torch
import onnx
import onnxruntime as ort
from torch_geometric.nn import SAGEConv
import os

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = SAGEConv(8, 16)
        self.conv2 = SAGEConv(16, 16)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

model = MyModel()
x = torch.randn(3, 8)
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]])

torch_output = model(x, edge_index).detach().numpy()

torch.onnx.export(model, (x, edge_index), 'model.onnx',
                    input_names=('x', 'edge_index'), opset_version=16)

model = onnx.load('model.onnx')
onnx.checker.check_model(model)

ort_session = ort.InferenceSession('model.onnx', providers=["CPUExecutionProvider"])

onnx_output = ort_session.run(None, {
    'x': x.numpy(),
    'edge_index': edge_index.numpy()
})[0]
assert onnx_output.shape == (3, 16)

print(onnx_output)
print(torch_output)

# it should be closed to 0
print("edge_index=", edge_index)
print("difference=", (onnx_output-torch_output).sum())

os.remove('model.onnx')

Environment

PyTorch version: 1.12.1 OS: Linux Python version: 3.8.0 CUDA/cuDNN version: None(cpu) How you installed PyTorch and PyG (conda, pip, source): pip Any other relevant information (e.g., version of torch-scatter): torch-geometric==2.2.0 torch-cluster==1.6.0 torch-scatter==2.0.9 torch-sparse==0.6.15 torch-spline-conv ==1.2.1 onnxruntime 1.12.0~1.16.3

opset_version=16

still outputs different values

rusty1s commented 8 months ago

Seeing this on PyTorch 2.1 and PyG master:

[[-0.20431207  0.22296612  0.7198481  -1.4561278  -0.80333525 -0.5460453
   0.18704385  0.42016378 -0.390862    0.13701025 -0.0441252   0.6888374
  -0.11858016 -0.15101878 -0.60684174  0.46188596]
 [ 0.17570746 -0.6933239   0.449192   -0.13163877 -0.5763432   0.48932934
  -0.5817795  -0.03735769 -0.35657042  0.82258743 -0.00645325 -0.10297411
  -0.6134554  -0.09732904  0.3812965   0.20178004]
 [-0.5805545   0.07748146  0.2428847  -0.84238297 -0.82637334 -0.70723593
   0.46391258  0.6402622  -0.3179192   0.51677203 -0.08096224  0.633123
   0.02019772 -0.24617971 -0.5585325  -0.06561629]]
[[-0.20431207  0.22296612  0.7198481  -1.4561278  -0.80333525 -0.5460453
   0.18704385  0.42016378 -0.390862    0.13701025 -0.0441252   0.6888374
  -0.11858016 -0.15101878 -0.60684174  0.46188596]
 [ 0.17570746 -0.6933239   0.449192   -0.13163877 -0.5763432   0.48932934
  -0.5817795  -0.03735769 -0.35657042  0.82258743 -0.00645325 -0.10297411
  -0.6134554  -0.09732904  0.3812965   0.20178004]
 [-0.5805545   0.07748146  0.2428847  -0.84238297 -0.82637334 -0.70723593
   0.46391258  0.6402622  -0.3179192   0.51677203 -0.08096224  0.633123
   0.02019772 -0.24617971 -0.5585325  -0.06561629]]
edge_index= tensor([[0, 1, 1, 2],
        [1, 0, 2, 1]])
difference= 0.0

lyimage commented 8 months ago

Seeing this on PyTorch 2.1 and PyG master:

[[-0.20431207  0.22296612  0.7198481  -1.4561278  -0.80333525 -0.5460453
   0.18704385  0.42016378 -0.390862    0.13701025 -0.0441252   0.6888374
  -0.11858016 -0.15101878 -0.60684174  0.46188596]
 [ 0.17570746 -0.6933239   0.449192   -0.13163877 -0.5763432   0.48932934
  -0.5817795  -0.03735769 -0.35657042  0.82258743 -0.00645325 -0.10297411
  -0.6134554  -0.09732904  0.3812965   0.20178004]
 [-0.5805545   0.07748146  0.2428847  -0.84238297 -0.82637334 -0.70723593
   0.46391258  0.6402622  -0.3179192   0.51677203 -0.08096224  0.633123
   0.02019772 -0.24617971 -0.5585325  -0.06561629]]
[[-0.20431207  0.22296612  0.7198481  -1.4561278  -0.80333525 -0.5460453
   0.18704385  0.42016378 -0.390862    0.13701025 -0.0441252   0.6888374
  -0.11858016 -0.15101878 -0.60684174  0.46188596]
 [ 0.17570746 -0.6933239   0.449192   -0.13163877 -0.5763432   0.48932934
  -0.5817795  -0.03735769 -0.35657042  0.82258743 -0.00645325 -0.10297411
  -0.6134554  -0.09732904  0.3812965   0.20178004]
 [-0.5805545   0.07748146  0.2428847  -0.84238297 -0.82637334 -0.70723593
   0.46391258  0.6402622  -0.3179192   0.51677203 -0.08096224  0.633123
   0.02019772 -0.24617971 -0.5585325  -0.06561629]]
edge_index= tensor([[0, 1, 1, 2],
        [1, 0, 2, 1]])
difference= 0.0

Thanks so much, I upgrade PyTorch version to 2.1.1, and the difference is gone.

pyg-team / pytorch_geometric

Model outputs different values after ONNX export #5742

🐛 Describe the bug

Environment