google-research / deeplab2

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.
Apache License 2.0
998 stars 157 forks source link

Can't export Motion Deeplab trained on MOTChallenge #129

Open fschvart opened 2 years ago

fschvart commented 2 years ago

I trained a Motion Deeplab using the MOTChallenge according to the instructions. The training process seems to have run fine (sidenote: is it reasonable that I could only use batch size = 1 due to CUDA OOM error? I use RTX 3090 with 24gb of memory?) I trained using the --use_two_frames for the dataset preparation.

Here's the full log and error message I get:

2022-09-02 18:36:39.055095: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-09-02 18:36:40.055893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21670 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6 2022-09-02 18:36:40.057630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21670 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6 I0902 18:36:40.674806 8796 motion_deeplab.py:53] Synchronized Batchnorm is used. I0902 18:36:40.675889 8796 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0} I0902 18:36:40.896352 8796 motion_deeplab.py:109] Setting pooling size to (68, 121) I0902 18:36:40.897225 8796 aspp.py:141] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0902 18:36:40.898226 8796 aspp.py:141] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D913410D0>, because it is not built. W0902 18:36:57.138313 8796 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D913410D0>, because it is not built. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D8F3805B0>, because it is not built. W0902 18:36:57.139913 8796 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D8F3805B0>, because it is not built. Traceback (most recent call last): File "C:\deeplab2\export_model.py", line 157, in app.run(main) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\absl\app.py", line 308, in run _run_main(main, args) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\absl\app.py", line 254, in _run_main sys.exit(main(argv)) File "C:\deeplab2\export_model.py", line 152, in main tf.saved_model.save( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1290, in save save_and_return_nodes(obj, export_dir, signatures, options) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1325, in save_and_return_nodes _build_meta_graph(obj, signatures, options, meta_graph_def)) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1491, in _build_meta_graph return _build_meta_graph_impl(obj, signatures, options, meta_graph_def) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1443, in _build_meta_graph_impl saveable_view = _SaveableView(augmented_graph_view, options) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 229, in init self.augmented_graph_view.objects_ids_and_slot_variables_and_paths()) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 544, in objects_ids_and_slot_variables_and_paths trackable_objects, node_paths = self._breadth_first_traversal() File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 255, in _breadth_first_traversal for name, dependency in self.list_children(current_trackable): File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 143, in list_children for name, child in super(_AugmentedGraphView, self).list_children( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 203, in list_children in obj._trackable_children(save_type, kwargs).items()] File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\engine\training.py", line 3201, in _trackable_children children = super(Model, self)._trackable_children(save_type, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\engine\base_layer.py", line 3174, in _trackable_children children = self._trackable_saved_model_saver.trackable_children(cache) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\base_serialization.py", line 59, in trackable_children children = self.objects_to_serialize(serialization_cache) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 68, in objects_to_serialize return (self._get_serialized_attributes( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 88, in _get_serialized_attributes object_dict, function_dict = self._get_serialized_attributes_internal( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\model_serialization.py", line 56, in _get_serialized_attributes_internal super(ModelSavedModelSaver, self)._get_serialized_attributes_internal( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 98, in _get_serialized_attributes_internal functions = save_impl.wrap_layer_functions(self.obj, serialization_cache) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 197, in wrap_layer_functions fn.get_concrete_function() File "C:\Users\FabianDual3\miniconda3\lib\contextlib.py", line 126, in exit next(self.gen) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 359, in tracing_scope fn.get_concrete_function(*args, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 1239, in get_concrete_function concrete = self._get_concrete_function_garbage_collected(*args, *kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 1219, in _get_concrete_function_garbage_collected self._initialize(args, kwargs, add_initializers_to=initializers) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 785, in _initialize self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2480, in _get_concrete_function_internal_garbage_collected graphfunction, = self._maybe_define_function(args, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2711, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2627, in _create_graph_function func_graph_module.func_graph_from_py_func( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\framework\func_graph.py", line 1141, in func_graph_from_py_func func_outputs = python_func(func_args, func_kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 677, in wrapped_fn out = weak_wrapped_fn().wrapped(*args, kwds) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 572, in wrapper ret = method(*args, *kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 168, in wrap_with_training_arg return control_flow_util.smart_cond( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\utils\control_flow_util.py", line 105, in smart_cond return tf.internal.smart_cond.smart_cond( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\framework\smart_cond.py", line 55, in smart_cond return false_fn() File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 170, in lambda: replace_training_and_call(False)) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 166, in replace_training_and_call return wrapped_call(args, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 652, in call return call_and_return_conditional_losses(inputs, *args, kwargs)[0] File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 610, in call return self.wrapped_call(*args, *kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 572, in wrapper ret = method(args, kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 168, in wrap_with_training_arg return control_flow_util.smart_cond( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\utils\control_flow_util.py", line 105, in smart_cond return tf.internal.smart_cond.smart_cond( File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 170, in lambda: replace_training_and_call(False)) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 166, in replace_training_and_call return wrapped_call(*args, *kwargs) File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 634, in call_and_return_conditional_losses call_output = layer_call(args, **kwargs) File "c:\deeplab2\video\motion_deeplab.py", line 128, in call input_tensor = self._add_previous_heatmap_to_input(input_tensor) File "c:\deeplab2\video\motion_deeplab.py", line 184, in _add_previous_heatmap_to_input if tf.reduce_all(tf.equal(frame1, frame2)): tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: Using a symbolic tf.Tensor as a Python bool is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

markweberdev commented 1 year ago

Hi @fschvart,

at the moment, our codebase is not optimised for GPU training. Thus, I'm unable to help you out regarding the increasing of batch size and figuring out to improve memory management. Something that you might wanna look into, is running our codebase not in graph mode but in eager mode. However, that might require some work on your end.

I have never tried to export the model, but judging from your error message, you might wanna check out https://www.tensorflow.org/api_docs/python/tf/cond .