Open fschvart opened 2 years ago
Hi @fschvart,
at the moment, our codebase is not optimised for GPU training. Thus, I'm unable to help you out regarding the increasing of batch size and figuring out to improve memory management. Something that you might wanna look into, is running our codebase not in graph mode but in eager mode. However, that might require some work on your end.
I have never tried to export the model, but judging from your error message, you might wanna check out https://www.tensorflow.org/api_docs/python/tf/cond .
I trained a Motion Deeplab using the MOTChallenge according to the instructions. The training process seems to have run fine (sidenote: is it reasonable that I could only use batch size = 1 due to CUDA OOM error? I use RTX 3090 with 24gb of memory?) I trained using the --use_two_frames for the dataset preparation.
Here's the full log and error message I get:
2022-09-02 18:36:39.055095: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-09-02 18:36:40.055893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21670 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6 2022-09-02 18:36:40.057630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21670 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6 I0902 18:36:40.674806 8796 motion_deeplab.py:53] Synchronized Batchnorm is used. I0902 18:36:40.675889 8796 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0} I0902 18:36:40.896352 8796 motion_deeplab.py:109] Setting pooling size to (68, 121) I0902 18:36:40.897225 8796 aspp.py:141] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0902 18:36:40.898226 8796 aspp.py:141] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D913410D0>, because it is not built. W0902 18:36:57.138313 8796 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D913410D0>, because it is not built. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D8F3805B0>, because it is not built. W0902 18:36:57.139913 8796 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x0000021D8F3805B0>, because it is not built. Traceback (most recent call last): File "C:\deeplab2\export_model.py", line 157, in
app.run(main)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\absl\app.py", line 308, in run
_run_main(main, args)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\absl\app.py", line 254, in _run_main
sys.exit(main(argv))
File "C:\deeplab2\export_model.py", line 152, in main
tf.saved_model.save(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1290, in save
save_and_return_nodes(obj, export_dir, signatures, options)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1325, in save_and_return_nodes
_build_meta_graph(obj, signatures, options, meta_graph_def))
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1491, in _build_meta_graph
return _build_meta_graph_impl(obj, signatures, options, meta_graph_def)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 1443, in _build_meta_graph_impl
saveable_view = _SaveableView(augmented_graph_view, options)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 229, in init
self.augmented_graph_view.objects_ids_and_slot_variables_and_paths())
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 544, in objects_ids_and_slot_variables_and_paths
trackable_objects, node_paths = self._breadth_first_traversal()
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 255, in _breadth_first_traversal
for name, dependency in self.list_children(current_trackable):
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\saved_model\save.py", line 143, in list_children
for name, child in super(_AugmentedGraphView, self).list_children(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\training\tracking\graph_view.py", line 203, in list_children
in obj._trackable_children(save_type, kwargs).items()]
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\engine\training.py", line 3201, in _trackable_children
children = super(Model, self)._trackable_children(save_type, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\engine\base_layer.py", line 3174, in _trackable_children
children = self._trackable_saved_model_saver.trackable_children(cache)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\base_serialization.py", line 59, in trackable_children
children = self.objects_to_serialize(serialization_cache)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 68, in objects_to_serialize
return (self._get_serialized_attributes(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 88, in _get_serialized_attributes
object_dict, function_dict = self._get_serialized_attributes_internal(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\model_serialization.py", line 56, in _get_serialized_attributes_internal
super(ModelSavedModelSaver, self)._get_serialized_attributes_internal(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\layer_serialization.py", line 98, in _get_serialized_attributes_internal
functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 197, in wrap_layer_functions
fn.get_concrete_function()
File "C:\Users\FabianDual3\miniconda3\lib\contextlib.py", line 126, in exit
next(self.gen)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 359, in tracing_scope
fn.get_concrete_function(*args, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 1239, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, *kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 1219, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 785, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2480, in _get_concrete_function_internal_garbage_collected
graphfunction, = self._maybe_define_function(args, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2711, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\function.py", line 2627, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\framework\func_graph.py", line 1141, in func_graph_from_py_func
func_outputs = python_func(func_args, func_kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\eager\def_function.py", line 677, in wrapped_fn
out = weak_wrapped_fn().wrapped(*args, kwds)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 572, in wrapper
ret = method(*args, *kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 168, in wrap_with_training_arg
return control_flow_util.smart_cond(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\utils\control_flow_util.py", line 105, in smart_cond
return tf.internal.smart_cond.smart_cond(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\framework\smart_cond.py", line 55, in smart_cond
return false_fn()
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 170, in
lambda: replace_training_and_call(False))
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 166, in replace_training_and_call
return wrapped_call( args, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 652, in call
return call_and_return_conditional_losses(inputs, *args, kwargs)[0]
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 610, in call
return self.wrapped_call(*args, *kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 572, in wrapper
ret = method(args, kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 168, in wrap_with_training_arg
return control_flow_util.smart_cond(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\utils\control_flow_util.py", line 105, in smart_cond
return tf.internal.smart_cond.smart_cond(
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 170, in
lambda: replace_training_and_call(False))
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\utils.py", line 166, in replace_training_and_call
return wrapped_call(*args, *kwargs)
File "C:\Users\FabianDual3\miniconda3\lib\site-packages\keras\saving\saved_model\save_impl.py", line 634, in call_and_return_conditional_losses
call_output = layer_call(args, **kwargs)
File "c:\deeplab2\video\motion_deeplab.py", line 128, in call
input_tensor = self._add_previous_heatmap_to_input(input_tensor)
File "c:\deeplab2\video\motion_deeplab.py", line 184, in _add_previous_heatmap_to_input
if tf.reduce_all(tf.equal(frame1, frame2)):
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: Using a symbolic
tf.Tensor
as a Pythonbool
is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.