TF2 exporter doesn't support dynamic batch size

chad-green commented 4 years ago

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/exporter_main_v2.py

2. Describe the bug

exporter_main_v2.py does not support changing the input shape. This feature was added to the tf1 exporter (export_inference_graph.py) in #2053, but looks like it's not included in the current tf2 version. As a result, all exported saved models have a fixed batch size of 1. In order to enable dynamic batching in triton inference server, we need an dynamic batch size. Please see related [issue] and recommendation for fix on triton github (https://github.com/triton-inference-server/server/issues/2097)

3. Steps to reproduce

Follow tutorial on exporting saved model https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#exporting-a-trained-model

4. Expected behavior

exporter_main_v2.py should accept an input_shape argument and default to [None, None, None, 3] such that exported models support dynamic batching, similar to tf1 export_inference_graph.py

chad-green commented 4 years ago

Hello, any updates on this issue/feature request? If I'm doing it wrong, go ahead and let me know. Just trying to get dynamic batching to work on triton inference server. @tombstone

lsrock1 commented 4 years ago

https://github.com/tensorflow/models/blob/7beddae1ff7207e7738693cdcdec389d16be83d3/research/object_detection/exporter_lib_v2.py#L133

How about just change this line to shape=[None, None, None, 3], it worked for me.

chad-green commented 4 years ago

Thanks for the reply and suggestion @lsrock1 . I checked out 7beddae and made the change. Here's the stack trace:

Traceback (most recent call last):
  File "exporter_main_v2.py", line 159, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "exporter_main_v2.py", line 155, in main
    FLAGS.side_input_types, FLAGS.side_input_names)
  File "/home/tensorflow/models/research/object_detection/exporter_lib_v2.py", line 259, in export_inference_graph
    concrete_function = detection_module.__call__.get_concrete_function()
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 1167, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 1073, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 697, in _initialize
    *args, **kwds))
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 600, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 973, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /home/tensorflow/models/research/object_detection/exporter_lib_v2.py:142 call_func  *
        return self._run_inference_on_images(input_tensor, **kwargs)
    /home/tensorflow/models/research/object_detection/exporter_lib_v2.py:107 _run_inference_on_images  *
        detections = self._model.postprocess(prediction_dict, shapes)
    /home/tensorflow/models/research/object_detection/meta_architectures/center_net_meta_arch.py:2890 postprocess  *
        boxes_strided, classes, scores, num_detections = (
    /home/tensorflow/models/research/object_detection/meta_architectures/center_net_meta_arch.py:357 prediction_tensors_to_boxes  *
        heights, widths = tf.unstack(height_width, axis=2)
    /home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201 wrapper  **
        return target(*args, **kwargs)
    /home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1558 unstack
        raise ValueError("Cannot infer num from shape %s" % value_shape)

    ValueError: Cannot infer num from shape (None, 100, None)

Were there any other changes? I didn't give you my environment before, but I'm running tf2.3.0-gpu from a docker container. Dockerfile is attached.

Dockerfile.txt

chad-green commented 4 years ago

It looks like this issue is specific to the centernet architecture. I confirmed that your change works (at least didn't get any errors, haven't tested dynamic batching in triton yet) for faster_rcnn_resnet101_v1_640x640_coco17_tpu-8, but not for centernet_resnet101

I did double check that export works fine on centernet without the code change to line 133 on exporter_lib_v2.py. Wanted to make sure I wasn't sending the wrong command.

chad-green commented 4 years ago

I confirmed that your change enabled dynamic batching for the Faster RCNN architecture on Triton Inference Server. I'll link your solution there as well. Still need a fix for centerNet, though. Thanks!

lsrock1 commented 4 years ago

I am glad that it helped!

chad-green commented 4 years ago

Can we leave this open until we get a solution to the error related to centernet, though? That's the one I actually need.