Open tfeher opened 3 years ago
To debug the problem, I have copied the code from trt_convert.py here:
func, model = get_func_from_saved_model(bert_saved_model_path)
# Create frozen func
from tensorflow.python.framework import convert_to_constants
frozen_func = convert_to_constants.convert_variables_to_constants_v2(func)
# Prepare for Grappler optimization pass
from tensorflow.python.training import saver
grappler_meta_graph_def = saver.export_meta_graph(
graph_def=frozen_func.graph.as_graph_def(), graph=frozen_func.graph)
from tensorflow.core.protobuf import config_pb2
from tensorflow.core.protobuf import meta_graph_pb2
from tensorflow.core.protobuf import rewriter_config_pb2
fetch_collection = meta_graph_pb2.CollectionDef()
for array in frozen_func.inputs + frozen_func.outputs:
fetch_collection.node_list.value.append(array.name)
grappler_meta_graph_def.collection_def["train_op"].CopyFrom(fetch_collection)
grappler_session_config = config_pb2.ConfigProto()
conv_params=trt.TrtConversionParams(
precision_mode='FP16', minimum_segment_size=50,
max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
custom_rewriter_config = trt._get_tensorrt_rewriter_config(
conversion_params=conv_params,
is_dynamic_op=True,
max_batch_size=None,
disable_non_trt_optimizers=False,
use_implicit_batch=False,
profile_strategy="Optimal")
grappler_session_config.graph_options.rewrite_options.CopyFrom(
custom_rewriter_config)
# Convert
from tensorflow.python.grappler import tf_optimizer
converted_graph_def = tf_optimizer.OptimizeGraph(grappler_session_config, grappler_meta_graph_def, graph_id=b"tf_graph")
This last step returns an empty graph def, we should throw an error in that case, to avoid the misleading error in Problem 1 cited above.
@bixia1 While the conversion of TF Hub Bert Large models fail, there are other versions of the BERT large models that can be converted with TF-TRT. This includes the NGC Bert models, and also HuggingFace Bert large models. Here is a script which demonstrate HuggingFace BERT model conversion. You need to run pip install transformers
(and pip install ipywidgets
if you are using a jupyter notebook).
import tensorflow as tf
import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
tf.get_logger().setLevel('ERROR')
# ## Helper functions
# In[3]:
def get_func_from_saved_model(saved_model_dir):
saved_model_loaded = tf.saved_model.load(
saved_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
return graph_func, saved_model_loaded
def trt_convert(input_path, output_path, input_shapes, explicit_batch=False,
dtype=np.float32, API='new', precision='FP32'):
conv_params=trt.TrtConversionParams(
precision_mode=precision, minimum_segment_size=50,
max_workspace_size_bytes=12*1<<30, maximum_cached_engines=1)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=input_path, conversion_params=conv_params,
use_dynamic_shape=explicit_batch,
dynamic_shape_profile_strategy="Optimal")
converter.convert()
def input_fn():
for shapes in input_shapes:
# return a list of input tensors
yield [np.ones(shape=x).astype(dtype) for x in shapes]
converter.build(input_fn)
converter.save(output_path)
# ## Get Huggingface BERT model
from transformers import TFBertModel
# Creation of a subclass in order to define a new serving signature.
# Define input size (set any of these to None for dynamic input size).
# Note TF-TRT fails to convert with dynamic input size.
batch_size = 1
seq_length = 128
class MyOwnModel(TFBertModel):
# Decorate the serving method with the new input_signature
# an input_signature represents the name, the data type and the shape of an expected input
@tf.function(input_signature=[{
"input_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="input_ids"),
"attention_mask": tf.TensorSpec((batch_size, seq_length), tf.int32, name="attention_mask"),
"token_type_ids": tf.TensorSpec((batch_size, seq_length), tf.int32, name="token_type_ids"),
}])
def serving(self, inputs):
# call the model to process the inputs
output = self.call(inputs)
# return the formated output
return self.serving_output(output)
# Instantiate the model with the new serving method
model = MyOwnModel.from_pretrained("bert-large-uncased")
# save it with saved_model=True in order to have a SavedModel version along with the h5 weights.
model.save_pretrained("my_hf_bert_large_model_static_shape", saved_model=True)
bert_saved_model_path = 'my_hf_bert_large_model_static_shape/saved_model/1'
# ## Convert the model with TF-TRT
bert_trt_path = bert_saved_model_path + '_trt'
input_shapes = [[(1, 128), (1, 128), (1, 128)]]
trt_convert(bert_saved_model_path, bert_trt_path, input_shapes, True, np.int32, precision='FP16')
CC: @pkanwar23 @sanjoy @WhiteFangBuck We talked about it this Wednesday ;)
- TF-TRT almost triples the model size
I have investigated why the model size triples and found two moments during conversion where duplication happens.
First, there is a nearly 2X duplication of constants in the first constant folding pass. This is due to 391 Const nodes being directly or indirectly the inputs of two distinct Identity nodes each.
E.g these nodes:
{{node unknown_42}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node Func/StatefulPartitionedCall/input/_46}} = Identity[T=DT_FLOAT](unknown_42);
{{node Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/input/_46, ^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Identity[T=DT_FLOAT](Func/StatefulPartitionedCall/StatefulPartitionedCall/input/_469);
result in these two constants:
{{node Func/StatefulPartitionedCall/input/_46}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
There are 386 Const nodes that appear both in the TRT segment and in the graph after TRT conversion pass. I think that they actually correspond to almost all of the 391 duplicate Identity ops mentioned previously, e.g I see this Const in both graphs:
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>](^Func/StatefulPartitionedCall/StatefulPartitionedCall/input_control_node/_422);
{{node StatefulPartitionedCall/StatefulPartitionedCall/model/bert_encoder/transformer/layer_2/self_attention/attention_output/einsum/Einsum/ReadVariableOp}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [16,64,1024] values: [[-0.0599743128 -0.0193152707 -0.0130688064...]]...>]();
It's duplicated because it is an input of nodes in both graphs (3 nodes in the base graph and 2 in the TRT segment).
Quick update on this: adding a "dependency"
pass before "constfold"
solves problem 1 and the graph becomes small enough to convert successfully (problem 2 remains).
The following code loads BERT Large model from TF Hub, and tries to convert using TF-TRT.
The conversion fails because the converted model reaches the protobuf size limit. The following error is printed to the terminal:
Problems
1. Error message
When executing the script from a Jupyter notebook, we see only the following message, which is not helpful. TF-TRT should provide a better error message.
2. TF-TRT almost triples the model size
The frozen graph size of 1.25 GiB is the expected size of a BERT large model. The size of the converted func is unexpectedly large.
3. Protobuf size limit
There are DL models whose size is larger than 2 GiB. TF-TRT conversion will hit the protobul size limit already at the step when a frozen func is created.
Tagging @bixia1 and @DEKHTIARJonathan.