Cannot export multilayer LSTM to tensorflow

chist commented 3 years ago

Describe the bug

I need to convert my PyTorch model to Tensorflow. For this purpose I use PyTorch —> ONNX —> Tensorflow approach.

However, I get the following error message when trying to run a prepared tensorflow model: ValueError: Dimensions must be equal, but are 6 and 7 for 'rnn/multi_rnn_cell/cell_0/lstm_cell/MatMul' (op: 'MatMul') with input shapes: [50,6], [7,12].

The problem appears when the initial PyTorch LSTM module has more than one layer:

num_layers = 2
self.lstm = nn.LSTM(input_size=4, hidden_size=hidden_num, num_layers=num_layers, batch_first=True)

If there is only one layer, everything works fine.

To Reproduce

Here's the full code:

import torch
import torch.nn as nn
import onnx
import tensorflow as tf
import torch.onnx
from onnx_tf.backend import prepare
import numpy as np

class MyLSTM(nn.Module):
    def __init__(self, hidden_num, num_layers):
        super(MyLSTM, self).__init__()
        self.linear = nn.Linear(200, 4)
        self.lstm = nn.LSTM(input_size=4, hidden_size=hidden_num,
                num_layers=num_layers, batch_first=True)

    def forward(self, x, h0c0):
        out = x.view(x.shape[0], 1, -1)
        out = self.linear(out)
        out, _ = self.lstm(out.view(x.shape[0], -1, 4), h0c0)
        out = out.view(x.shape[0], -1)
        return out

batch_size = 50
hidden_num = 3
num_layers = 2
model = MyLSTM(hidden_num, num_layers)

# test pytorch model
inputs = torch.zeros(batch_size, 10, 20)
h0 = torch.zeros(num_layers, batch_size, hidden_num)
c0 = torch.zeros(num_layers, batch_size, hidden_num)
out = model(inputs, (h0, c0))
print(out)

# export from pytorch to ONNX
onnx_path = "./lstm.onnx"
torch.onnx.export(model, (inputs, (h0, c0)), onnx_path,
        dynamic_axes={'input': {0: 'batch'},
                      'h0': {1: 'batch'}, 'c0': {1: 'batch'},
                      'output': {0: 'batch'}},
        input_names=['input', 'h0', 'c0'], output_names=['output'])

# load ONNX model and create tensorflow representation
onnx_model = onnx.load(onnx_path)
tf_rep = prepare(onnx_model, device='cpu')

# run tensorflow model
inputs = (np.zeros((batch_size, 10, 20), dtype=np.float32),
          np.zeros((num_layers, batch_size, hidden_num), dtype=np.float32), 
          np.zeros((num_layers, batch_size, hidden_num), dtype=np.float32)) 
result = tf_rep.run(inputs)  # RUN-TIME ERROR HERE
print(result)

ONNX model file

https://drive.google.com/file/d/1kaCryP-My7_I4Gd2BS37uKUzYB6MkHaG/view?usp=sharing

Python, ONNX, ONNX-TF, Tensorflow version

Python version: 2.7.18 |Anaconda, Inc.| (default, Apr 23 2020, 17:30:41) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
ONNX version: 1.7.0
ONNX-TF version: 1.6.0
Tensorflow version: 2.1.0

(this is the only configuration I managed to export with)

Additional context

As I understand, such problems usually arise when the same cells are used for different LSTM layers instead of creating new ones: https://stackoverflow.com/a/48796202

Moreover, the problem doesn't disappear if I split multilayer LSTM into several one-layer LSTMs.

So, I believe that it is connected to the lines 34–37 in rnn_mixing.py: https://github.com/onnx/onnx-tensorflow/blob/c63d4351c7752a769cdc9a1bfcf79ffd140e0e6a/onnx_tf/handlers/backend/rnn_mixin.py#L34-L37

I also have to note that the problem concerns the onnx-tensorflow module because conversion from PyTorch to ONNX and backwards (using onnxruntime) works as expected.

faroit commented 3 years ago

@chist fall for the same problem. did you figure out a solution?

aishoot commented 3 years ago

I can convert one-layer LSTM and multilayer LSTMs to tensorflow (onnx 1.7.0, onnx-tf 1.7.0 ). However, the results between onnx and onnx-tf are not equal when using multi-layer LSTM, while the results between onnx and onnx-tf when using single-layer LSTM are the same.

aishoot commented 3 years ago

@chist have you figured out a solution?

shocoladka commented 3 years ago

I probably faced the same error when was trying to load several instances of LSTM model in one session I believe that the problem is that rnn_cell is a global class variable

https://github.com/onnx/onnx-tensorflow/blob/c63d4351c7752a769cdc9a1bfcf79ffd140e0e6a/onnx_tf/handlers/backend/rnn_mixin.py#L28

And when you create several instances of LSTM, rnn_cell is getting initialized only once

I suspect this is not an expected behavior. @chinhuang007 Can you take a look please?

phamtrongthang123 commented 3 years ago

The bug is still happening. I cannot run the exported model with multilayer lstm.

Python 3.7 / 3.8
ONNX-TF 1.9.0
ONNX 1.10.1
Tensorflow 2.6.0

Error:

InvalidArgumentError:  Matrix size-incompatible: In[0]: [219,512], In[1]: [336,1024]
     [[{{node LSTM_2ab5f684/rnn/while/body/_58/LSTM_ad6f857e/rnn/while/rnn/multi_rnn_cell/cell_0/lstm_cell/BiasAdd}}]] [Op:__inference___call___2981]

Function call stack:
__call__

We can reproduce this bug in a colab environment (both GPU and CPU), here is the reproduce code:

!pip install onnx-tf
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
import onnx
import onnx_tf
from onnx_tf.backend import prepare
import numpy as np
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
G = nn.LSTM(input_size=80,
            hidden_size=256,
            num_layers=3, # bug at 3
            dropout=0,
            bidirectional=False,
            batch_first=True)
G = G.to(device)
aus = torch.Tensor(np.zeros((219,18,80))).to(device)
torch.onnx.export(G,
        args=(aus),
        f="audio_G.onnx",
        input_names=["au"],
        output_names=["a","b","c"],
        opset_version=12)
model = onnx.load('audio_G.onnx')
tf_rep = prepare(model)
o,w,e = tf_rep.run((aus.cpu()))

onnx / onnx-tensorflow

Cannot export multilayer LSTM to tensorflow #796