Conversion of Bert2Bert model from Tensorflow official fails

onnx / tensorflow-onnx

Convert TensorFlow, Keras, Tensorflow.js and Tflite models to ONNX

Apache License 2.0

2.3k stars 432 forks source link

Conversion of Bert2Bert model from Tensorflow official fails #1634

Closed LoicDagnas closed 2 years ago

LoicDagnas commented 3 years ago

Describe the bug Converting a Bert2Bert model from TensorFlow model official, I get the exact same error either using the conversion capacity from pb or from function:

Traceback (most recent call last):
  File "C:/dev/ml/TextGenerator/text_generator/models/bert2bert/save_model.py", line 167, in <module>
    test()
  File "C:/dev/ml/TextGenerator/text_generator/models/bert2bert/save_model.py", line 160, in test
    "segment_ids": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32)
  File "C:\dev\ml\TextGenerator\venv\lib\site-packages\tf2onnx\convert.py", line 533, in from_function
    frozen_graph = tf_loader.from_function(concrete_func, input_names, output_names, large_model=large_model)
  File "C:\dev\ml\TextGenerator\venv\lib\site-packages\tf2onnx\tf_loader.py", line 247, in from_function
    graph_def = tf_optimize(input_names, output_names, graph_def)
  File "C:\dev\ml\TextGenerator\venv\lib\site-packages\tf2onnx\tf_loader.py", line 666, in tf_optimize
    graph_def = tf_optimize_grappler(input_names, output_names, graph_def, fold_constant)
  File "C:\dev\ml\TextGenerator\venv\lib\site-packages\tf2onnx\tf_loader.py", line 650, in tf_optimize_grappler
    graph_def = tf_opt.OptimizeGraph(config, meta_graph)
  File "C:\dev\ml\TextGenerator\venv\lib\site-packages\tensorflow\python\grappler\tf_optimizer.py", line 58, in OptimizeGraph
    graph_id, strip_default_attributes)
tensorflow.python.framework.errors_impl.InvalidArgumentError: input resource[0] expected type resource != float, the type of bert2_bert_while_decoder_gather_resource_0[0]
    In {{node bert2_bert/while/decoder/Gather}}

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10.0.19042
Tensorflow Version: 2.5.0
Python version: 3.7.6

To Reproduce Here is the minimal code to reproduce my issue, it uses the Bert2Bert model from Tensorflow model official.

from official.nlp.nhnet.models import Bert2Bert, get_bert2bert_layers
from official.nlp.nhnet.configs import UNITTEST_CONFIG, BERT2BERTConfig
import tf2onnx

bert2bert_config = BERT2BERTConfig.from_args(**UNITTEST_CONFIG, len_title=32)
bert_layer, decoder_layer = get_bert2bert_layers(params=bert2bert_config)
bert2bert = Bert2Bert(bert2bert_config, bert_layer, decoder_layer)

@tf.function()
def serve(inputs):
    return bert2bert(inputs=inputs, mode="predict")

model_proto, _ = tf2onnx.convert.from_function(
    function=serve,
    opset=14,
    input_signature=[{
        "input_ids": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32),
        "input_mask": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32),
        "segment_ids": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32)
    }],
)

If it is simpler, I also attach a pb of the Bert2Bert model saved_model.zip, to reproduce the exact same bug, simply run

python -m tf2onnx.convert --saved_model path/to/pb --output path/to/onnx --tag serve --signature_def serve --opset 14

Additional context However, when using my custom graph freezing method and then the tf2onnx conversion from the resulting frozen graph, it works perfectly fine. Here is how I froze my graph:

import tensorflow as tf
import pathlib
from tensorflow.lite.python.util import run_graph_optimizations, get_grappler_config
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

saved_model = tf.saved_model.load(saved_model_dir)

concrete_fn = saved_model.signatures['serve']
concrete_fn.inputs[0].set_shape([batch_size, max_seq_length])
concrete_fn.inputs[1].set_shape([batch_size, max_seq_length])
concrete_fn.inputs[2].set_shape([batch_size, max_seq_length])

frozen_concrete_fn = convert_variables_to_constants_v2(concrete_fn)
frozen_concrete_graph_def = frozen_concrete_fn.graph.as_graph_def()

input_tensors = [
    tensor for tensor in frozen_concrete_fn.inputs
    if tensor.dtype != tf.resource
]

output_tensors = frozen_concrete_fn.outputs

frozen_concrete_graph_def = run_graph_optimizations(
    frozen_concrete_graph_def,
    input_tensors,
    output_tensors,
    config=get_grappler_config(list(grappler_config)),
    graph=frozen_concrete_fn.graph
)

output_dir = pathlib.Path(saved_model_dir).parent
frozen_graph_name = f"frozen_graph.bs{batch_size}.sl{max_seq_length}.pb"

tf.io.write_graph(graph_or_graph_def=frozen_concrete_graph_def,
                  name=frozen_graph_name,
                  logdir=str(output_dir.absolute()),
                  as_text=False)

TomWildenhain-Microsoft commented 3 years ago

Hi @LoicDagnas it looks like there's a freezing issue in TF when setting low_control_flow to false: convert_variables_to_constants_v2(func, lower_control_flow=False)

TF fails to replace the ResourceGather op with a normal Gather. We set lower_control_flow to false so we can maintain the subgraph structure.

I have a workaround for TF but I'm still getting an issue with your model when I try to run it. Were you able to run the model successfully? If so, what input data did you use?

LoicDagnas commented 3 years ago

Hi @TomWildenhain-Microsoft , nope I hadn't tried to load and serve the model yet, I just did and it fails for me too:

import onnxruntime

sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
session = onnxruntime.InferenceSession("path/to/model.onnx", sess_options, providers=["CPUExecutionProvider"])

gives the following error:

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from C:\dev\tmp\model.onnx failed:Invalid tensor data type

TomWildenhain-Microsoft commented 3 years ago

Interesting, the model I got loads but fails at inference time, not session creation. Can you make sure your onnx package is up to date and convert again?

pip uninstall onnx pip install onnx

Here's the model I got: https://microsoft-my.sharepoint-df.com/:u:/p/tomwi/EW9L7b0INGtIruE89k7akaoBV8Jqb_1RFqmpYvIgjlrgsw?e=PZ6v4p If you still get a session creation error, please upload your .onnx file. It is complaining about an "Invalid tensor data type" but I'm not sure where in the model that is.

In any case, you should be able create a session. Do you have example data to run the model on?

LoicDagnas commented 3 years ago

After upgrading onnx from 1.9.0 to 1.10.1, I am able to load my model in onnx format and also the one you attached in your last message.

Now, feeding the model with some dummy data:

import onnxruntime

sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
session = onnxruntime.InferenceSession("C:\\dev\\tmp\\bert2bert.onnx", sess_options,
                                       providers=["CPUExecutionProvider"])

input_ids = [101, 2023, 2633, 4504, 1999, 6094, 2008, 6461, 2000, 5787, 2925, 7521, 1997, 2822, 3667, 2000, 1996,
             2142, 2163, 1010, 1998, 5561, 2000, 14768, 8041, 4262, 2090, 1996, 2142, 2163, 1998, 2859, 1012, 102,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
             ]

input_feed = {
    "inputs": [input_ids],
    "inputs_1": [[0 if i == 0 else 1 for i in input_ids]],
    "inputs_2": [[0 for _ in input_ids]],
    "inputs_3": [[0 for _ in range(32)]]
}
output = session.run(output_names=None, input_feed=input_feed)

I get in both case (my model and yours) the same error:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Expand node. Name:'StatefulPartitionedCall/bert2_bert/bert_encoder/position_embedding/BroadcastTo' Status Message: invalid expand shape

TomWildenhain-Microsoft commented 3 years ago

@LoicDagnas well at least we both get the same error now.

The most recent issue is present in the TF model as well. When load and run the saved model in tensorflow I get: ValueError: Dimensions must be equal, but are 128 and 200 for '{{node position_embedding/BroadcastTo}} = BroadcastTo[T=DT_FLOAT, Tidx=DT_INT32](position_embedding/strided_slice_1, position_embedding/Shape)' with input shapes: [128,16], [3] and with input tensors computed as partial shapes: input[1] = [1,200,16]

The problem is caused by changing the shapes. You unfortunately can't change the max sequence length from 128 to 200 just be changing the signature. There are already ops in the graph with that shape hard-coded in, and the model weights may only work on that size.

LoicDagnas commented 3 years ago

Mea culpa, I have wrongly build my minimal example.

The serving error simply comes from the default value of max_position_embeddings in the Bert2BertConfig. Setting this value to a value greater or equal to 200 solves the problem for both the pb serving and the ONNX serving. To set such parameter, you can do:

from official.nlp.nhnet.models import Bert2Bert, get_bert2bert_layers
from official.nlp.nhnet.configs import BERT2BERTConfig, UNITTEST_CONFIG

bert2bert_config_dict = UNITTEST_CONFIG.copy()
bert2bert_config_dict["max_position_embeddings"] = 200
bert2bert_config_dict["len_title"] = 32

bert2bert_config = BERT2BERTConfig.from_args(**bert2bert_config_dict)
bert_layer, decoder_layer = get_bert2bert_layers(params=bert2bert_config)
bert2bert = Bert2Bert(bert2bert_config, bert_layer, decoder_layer)

Here is the newly generated model in ONNX format

model.zip

you can run an inference with the folllowing snippet:

import onnxruntime

sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
session = onnxruntime.InferenceSession("C:\\dev\\tmp\\model.onnx", sess_options, providers=["CPUExecutionProvider"])

input_ids = [101, 2023, 2633, 4504, 1999, 6094, 2008, 6461, 2000, 5787, 2925, 7521, 1997, 2822, 3667, 2000, 1996,
             2142, 2163, 1010, 1998, 5561, 2000, 14768, 8041, 4262, 2090, 1996, 2142, 2163, 1998, 2859, 1012, 102
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

input_feed = {
    "input_ids:0": [input_ids],
    "input_mask:0": [[0 if i == 0 else 1 for i in input_ids]],
    "segment_ids:0": [[0 for _ in input_ids]],
    "target_ids:0": [[0 for _ in range(32)]]
}

output = session.run(output_names=None, input_feed=input_feed)

So, finally, the only remaining problem is the very first one about the model conversion to ONNX format without using my custom graph freezing method. You mentionned a workaround about that? Is it already usable in the lib?

TomWildenhain-Microsoft commented 3 years ago

No problem, glad to hear you got the model working! This code is correctly converting and running the model for me, with no changes to tf2onnx:

import tensorflow as tf
import os
import onnxruntime as ort
import numpy as np

from official.nlp.nhnet.models import Bert2Bert, get_bert2bert_layers
from official.nlp.nhnet.configs import BERT2BERTConfig, UNITTEST_CONFIG

bert2bert_config_dict = UNITTEST_CONFIG.copy()
bert2bert_config_dict["max_position_embeddings"] = 200
bert2bert_config_dict["len_title"] = 32

bert2bert_config = BERT2BERTConfig.from_args(**bert2bert_config_dict)
bert_layer, decoder_layer = get_bert2bert_layers(params=bert2bert_config)
bert2bert = Bert2Bert(bert2bert_config, bert_layer, decoder_layer)

input_ids = [101, 2023, 2633, 4504, 1999, 6094, 2008, 6461, 2000, 5787, 2925, 7521, 1997, 2822, 3667, 2000, 1996,
             2142, 2163, 1010, 1998, 5561, 2000, 14768, 8041, 4262, 2090, 1996, 2142, 2163, 1998, 2859, 1012, 102,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

input_feed = {
    "input_ids": np.array([input_ids], np.int32),
    "input_mask": np.array([[0 if i == 0 else 1 for i in input_ids]], np.int32),
    "segment_ids": np.array([[0 for _ in input_ids]], np.int32),
    "target_ids": np.array([[0 for _ in range(32)]], np.int32)
}

print(bert2bert(input_feed))
bert2bert.save("bert2bert")

os.system("python -m tf2onnx.convert --saved-model bert2bert --output bert2bert.onnx --opset 14")

sess = ort.InferenceSession("bert2bert.onnx")
print(sess.run(None, input_feed))

I'm noticing that the above code (as well as the working model you just uploaded) makes a model that does not contain a loop, making it much easier to convert (and probably faster to run). It also avoids the freezing error of the ResourceGather op, since there are no subgraphs. I do not know why some of these strategies produce models with loops, while others do not. Maybe the model contains multiple signatures and we are sometimes getting a different one?

LoicDagnas commented 3 years ago

Hum, you're rigth, the code above does succeed to convert the Bert2bert model, I didn't notice it.

Unfortunately, I have a custom model inheriting from the Bert2bert class with a custom signature as in my first example:

from official.nlp.nhnet.models import Bert2Bert, get_bert2bert_layers
from official.nlp.nhnet.configs import UNITTEST_CONFIG, BERT2BERTConfig
import tf2onnx

bert2bert_config = BERT2BERTConfig.from_args(**UNITTEST_CONFIG, len_title=32)
bert_layer, decoder_layer = get_bert2bert_layers(params=bert2bert_config)
bert2bert = Bert2Bert(bert2bert_config, bert_layer, decoder_layer)

@tf.function()
def serve(inputs):
    return bert2bert(inputs=inputs, mode="predict")

model_proto, _ = tf2onnx.convert.from_function(
    function=serve,
    opset=14,
    input_signature=[{
        "input_ids": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32),
        "input_mask": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32),
        "segment_ids": tf.TensorSpec(shape=(None, 200,), dtype=tf.int32)
    }],
)

but it should'nt really change the graph internal structure since it is basically a simple call of the model call method.

TomWildenhain-Microsoft commented 3 years ago

Yeah, you would think it would be the same, but the way TF models work is they can have multiple function definitions that are chosen based on the signature, but it is rare for those to be super different (normally one just has an additional optional arg or something). We'll need to investigate more, but I think what you are doing with:

@tf.function()
def serve(inputs):
    return bert2bert(inputs=inputs, mode="predict")

vs bert2bert.save("bert2bert") are for some reason choosing different implementations that are very different and one contains a loop. We should theoretically be able to convert it but loops can get dicey and something is going wrong. But in any case, the version without the loop is probably better for you anyway, so we need to figure out why it is picking one and not the other.

I'm noticing that the target_ids input is included in only one of the two approaches. Does adding target_ids to your above attempt fix it?

LoicDagnas commented 3 years ago

I had the same intuition, but no, adding the target_ids in the above custom signature doesn't solve the problem 😢

By the way, in the example with the target_ids, I don't get why we have to have them in the serving signature since these ids should only be used at training time. Anyway, in this case it is an issue I should declare on the TensorFlow official repo.

TomWildenhain-Microsoft commented 3 years ago

Yeah it's strange that they are so different. In any case, you should have a working model now. I'm reluctant to add the hack to fix ResourceGather op freezing until I can confirm it works on a model, and we can't get the loop version of this model to work yet. It might have other weirdness that is actually causing the failure. If you figure out what is causing the different models to be selected and/or discover that you actually need the loop version of the model to convert, let me know. Otherwise, is it ok if we close this issue?

LoicDagnas commented 3 years ago

Infortunately, I still have an issue with the model I am using for real, it is not simply the Bert2bert model but a model inheriting from it.

I am able to convert it to ONNX format using the snippet of code given here https://github.com/onnx/tensorflow-onnx/issues/1634#issue-958156867 but I cannot load it.

Here is the onnx model: https://drive.google.com/drive/folders/1cLfJ2Eyh3KCkvD0mzngfOKOhM69ieyE1?usp=sharing

The loading of the ONNX model gives the very concise error failed:Invalid tensor data type.

To be more precise here is my implementation of the Bert2bert model

class TextGenerator(Bert2Bert):

    def __init__(self, params, bert_layer, decoder_layer):
        super().__init__(params, bert_layer, decoder_layer)

    def call(self, inputs, mode="train"):

        bert2bert_inputs = inputs.copy()
        bert2bert_inputs['input_mask'] = tf.where(inputs['input_ids'] == 0, 0, 1)
        bert2bert_inputs['segment_ids'] = tf.zeros_like(inputs['input_ids'])

        if mode != 'predict_with_sampling':
            return super(TextGenerator, self).call(bert2bert_inputs, mode)
        else:
            return self.serve_with_sampling(bert2bert_inputs)

    @tf.function()
    def serve_with_sampling(self, serve_inputs: Dict[str, tf.Tensor]):

        input_ids = serve_inputs['input_ids']
        input_mask = serve_inputs['input_mask']
        segment_ids = serve_inputs['segment_ids']
        queries_count = serve_inputs['queries_count']

        all_encoder_outputs, _ = self.bert_layer([input_ids, input_mask, segment_ids])

        encoder_decoder_attention_bias = get_attention_bias(
            input_ids,
            bias_type="single_cross",
            padding_value=self.params.pad_token_id)

        batch_size = tf.shape(input_ids)[0]

        start_token_ids = tf.ones([batch_size],
                                  tf.int32) * self.params.start_token_id

        if self.params.use_cache:
            cache = self._init_cache(batch_size)
        else:
            cache = {}

        cache["all_encoder_outputs"] = all_encoder_outputs
        cache["attention_bias"] = encoder_decoder_attention_bias

        generator = SamplingModule(
            symbols_to_logits_fn=self._get_symbols_to_logits_fn(self.params.len_title),
            length_normalization_fn=None,
            vocab_size=self.params.vocab_size,
            max_decode_length=self.params.len_title,
            eos_id=self.params.end_token_id,
            padded_decode=False,
            enable_greedy=False,
            top_k=10,
            top_p=1.0,
            sample_temperature=1.0
        )

        @tf.function()
        def generator_call():
            result, _ = generator.generate(start_token_ids, cache)
            return result

        return tf.transpose(
            a=tf.map_fn(fn=lambda t: generator_call(),
                        elems=tf.range(tf.squeeze(queries_count))),
            perm=[1, 0, 2])

and here is how I save the model:

@tf.function()
def serve(input_ids: Dict[str, tf.Tensor], queries_count: int):
    return text_generator(
        inputs={'input_ids': input_ids, 'queries_count': queries_count},
        mode="predict_with_sampling")
text_generator.save(saved_model_dir, signatures={
    'serve': serve.get_concrete_function(
        input_ids=tf.TensorSpec(shape=(None, max_input_length,), dtype=tf.int32),
        queries_count=tf.TensorSpec(shape=(), dtype=tf.int32)
    )
})

TomWildenhain-Microsoft commented 3 years ago

Yeah this model is pretty different from the others. It has some nested loops and stuff. Looking at the onnx model I can clearly see a problem: The TensorListReserve op is not supported. It should have given you an error during conversion with a list of unsupported ops. We should be able to convert that though. The pattern containing it isn't matching for some reason. I'll need a saved model to figure out the conversion.

I think you forgot to include the line text_generator = TextGenerator(...). I need to know what params you are using to make the saved model / frozen graph (or you can upload one).

LoicDagnas commented 3 years ago

I upload here:

the saved model
the intermediary frozen graph
the final ONNX model

To create the model, I run the following script:

def create_model(bert_config: str,
                 max_input_length: int,
                 max_output_length: int,
                 queries_count: int = 20,
                 batch_size: int = None) -> tf.keras.Model:
    """
    Create a model wrapping around the Google BERT2BERT implementations.
    :param bert_config: the configuration of the underlying BERT model
    :param max_input_length: maximum length of a single passage
    :param max_output_length: maximum length of the generated texts
    :param queries_count: number of queries
    :param batch_size: size of the batch
    :return:
    """

    # Build the base BERT configuration
    bert_config = BertConfig.from_json_file(bert_config).to_dict()

    # Drop the parameters which don't exist in the BERT2BERT configuration
    bert_config.pop('embedding_size', None)
    bert_config.pop('backward_compatible', None)

    # Finally build the BERT2BERT configuration, setting the exact same parameters for decoders
    bert2bert_config = BERT2BERTConfig(
        num_decoder_attn_heads=bert_config['num_attention_heads'],
        num_decoder_layers=bert_config['num_hidden_layers'],
        decoder_intermediate_size=bert_config['intermediate_size'],
        beam_size=queries_count,
        **bert_config)

    bert2bert_config.override(
        {
            "len_title": max_output_length,
        },
        is_strict=False
    )

    # Build the encoder decoder model
    bert_layer, decoder_layer = get_bert2bert_layers(params=bert2bert_config)

    # Create the main model instance
    text_generator = TextGenerator(
        params=bert2bert_config,
        bert_layer=bert_layer,
        decoder_layer=decoder_layer)

    # Call the model
    input_ids = tf.keras.layers.Input((max_input_length,), dtype=tf.int32, name="input_ids", batch_size=batch_size)
    queries_count = tf.keras.layers.Input((), dtype=tf.int32, name="queries_count")
    target_ids = tf.keras.layers.Input((max_output_length,), dtype=tf.int32, name="target_ids")

    inputs = {
        'input_ids': input_ids,
        'queries_count': queries_count,
        'target_ids': target_ids
    }

    text_generator(inputs, mode='predict_with_sampling')

    return text_generator

where the bert_config argument is a path to a json file containing such a BERT configuration:

{
  "hidden_size": 256,
  "hidden_act": "gelu",
  "initializer_range": 0.02,
  "vocab_size": 30522,
  "hidden_dropout_prob": 0.1,
  "num_attention_heads": 4,
  "type_vocab_size": 2,
  "max_position_embeddings": 512,
  "num_hidden_layers": 4,
  "intermediate_size": 1024,
  "attention_probs_dropout_prob": 0.1
}

TomWildenhain-Microsoft commented 3 years ago

The freezing issue for the saved model will be fixed here: #1672

Turns out Tensorflow doesn't implement freezing that op: https://github.com/tensorflow/tensorflow/issues/51488

There are still issues to resolve though for this model. What are you using it for btw? I notice that you are converting a lot of different models to ONNX.

LoicDagnas commented 3 years ago

Hello @TomWildenhain-Microsoft,

Sorry for answering so late, I was in holidays.

First of all, thanks since the conversion seems to work now, so it solves the first part of my issue.

Here is the serving error I get now:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Loop node. Name:'StatefulPartitionedCall/query_generator/StatefulPartitionedCall/map/while_loop' Status Message: Non-zero status code returned while running Loop node. Name:'map/while/StatefulPartitionedCall/while_loop' Status Message: Non-zero status code returned while running Gather node. Name:'while/decoder/word_embeddings/Gather' Status Message: indices element out of data bounds, idx=30522 must be within the inclusive range [-30522,30521]

running:

import onnxruntime

sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
session = onnxruntime.InferenceSession('path/to/model.onnx', sess_options, providers=["CPUExecutionProvider"])

input_ids = [101, 2023, 2633, 4504, 1999, 6094, 2008, 6461, 2000, 5787, 2925, 7521, 1997, 2822, 3667, 2000, 1996,
             2142, 2163, 1010, 1998, 5561, 2000, 14768, 8041, 4262, 2090, 1996, 2142, 2163, 1998, 2859, 1012, 102,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

input_feed = {
    "input_ids": [input_ids],
    "queries_count": [5]
}

session.run(output_names=None, input_feed=input_feed)

To answer your last question, we train TensorFlow models in my company, and we are currently evaluating the feasibility to choose ORT as inference runtime.

TomWildenhain-Microsoft commented 3 years ago

@LoicDagnas

I'm able to to get the same error on my side. This looks like it is going to be pretty hard to track down and I'm a bit busy this week, but I'll try when I get a chance. The error is definitely a bad calculation (likely an incorrect node wiring during conversion) that is evident in the first iteration of the loop. To debug, I might have to capture all the intermediate outputs from the first iteration of the loop in tf/onnx and compare them.

Also let me know if you want to talk to one of our PMs. They like to hear from ORT users.

LoicDagnas commented 3 years ago

Cool thank you, let me know if you need anything else to help your investigations.

We'll be glad to talk with one of your PMs, they can reach me at dagnas@sinequa.com.

LoicDagnas commented 3 years ago

hello @TomWildenhain-Microsoft by any chance did you have a bit of time to have a look on the issue?

Thanks.

TomWildenhain-Microsoft commented 3 years ago

Hey @LoicDagnas sorry for the delay here. I haven't had a chance to look at this and my responsibilities have shifted away from tf2onnx so I'm not sure I'll be able to. If you have time and are willing to help determine the issue in tf2onnx then fixing it will be a lot easier.

Here's the approach I'd take: Using --output_frozen_graph during conversion we can get a tf graph and an onnx graph to compare. Open them in netron. From some debugging I did before I think the issue is with the inner most graph (inside 2 nested loops) something is going wrong and one of the graph outputs doesn't match TF. If you can pinpoint what that is through examination, then maybe you can find the bug. Keep in mind the ONNX loop semantics differ from tf (https://github.com/onnx/onnx/blob/master/docs/Operators.md#Loop)

If you can't find the issue through manual inspection, then you will need to get all the intermediate values out of the graph in both ONNX and TF and find the first discrepancy. I'd bet it isn't actually a miscalculation (since the ops look the same to me) but a mis-wiring where something goes to the wrong place. To get intermediate values without rebuilding ORT you can spy on values by inserting a custom python op (https://github.com/microsoft/onnxruntime-extensions/blob/main/tutorials/tf2onnx_custom_ops_tutorial.ipynb). I think tf has something similar.

This type of bug where tf2onnx generates a valid model that produces invalid results is rare and is probably one of the toughest to debug (especially in a subgraph where you can't just add all the values as graph outputs to read them).

fatcat-z commented 2 years ago

It's been over 2 months, so closing this. Feel free to open a new one if the issue still exists.