google-research / deeplab2

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.
Apache License 2.0
1.01k stars 159 forks source link

Exported saved_model file performs bad comparing to eval on same images #110

Closed fschvart closed 2 years ago

fschvart commented 2 years ago

Hi,

I trained a two semantic segmentation "panoptic segmentation" models, one with a resnet_beta and one with swidernet backbones and I have the same issue with both. When I try to export the 60k checkpoint to a saved_model (If there's another format that might work better I'm happy to try it) the export completes but the models deliver inferior performance comparing with the model latest eval when using the same images. The general shape of the masks is reasonable, but there are some spots that are missing and at the some time some spots appear randomly in the image.

I already tried disabling axial_use_recompute_grad, and that didn't help. Also following the process, there's a file in the saved_model folder and two in the variables folder, but none in the assets folder.

I'm using Windows 11 and an RTX 3090.

Here's the output during the export process:

2022-07-12 16:09:39.451488: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-12 16:09:40.292519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6 2022-07-12 16:09:40.293903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 3624 MB memory: -> device: 1, name: Quadro P2200, pci bus id: 0000:02:00.0, compute capability: 6.1 I0712 16:09:40.875252 10032 deeplab.py:57] Synchronized Batchnorm is used. I0712 16:09:40.877247 10032 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': False, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0} I0712 16:09:40.994932 10032 deeplab.py:96] Setting pooling size to (27, 41) I0712 16:09:40.994932 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:41.424783 10032 api.py:459] Eval with scales ListWrapper([1.0]) I0712 16:09:41.576378 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:41.577375 10032 api.py:459] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:09:44.969014 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:46.971171 10032 api.py:459] Eval with scales ListWrapper([1.0]) I0712 16:09:46.973166 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:46.974163 10032 api.py:459] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:09:47.611459 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:48.245076 10032 api.py:459] Eval with scales ListWrapper([1.0]) I0712 16:09:48.247071 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:48.248068 10032 api.py:459] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:09:48.894341 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:49.202517 10032 api.py:459] Eval with scales ListWrapper([1.0]) I0712 16:09:49.204511 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:49.204511 10032 api.py:459] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:09:49.855770 10032 api.py:459] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x000001398E79FE20>, because it is not built. W0712 16:09:52.798902 10032 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x000001398E79FE20>, because it is not built. WARNING:tensorflow:Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x000001398E79F3D0>, because it is not built. W0712 16:09:52.799900 10032 save_impl.py:71] Skipping full serialization of Keras layer <deeplab2.model.layers.resized_fuse.ResizedFuse object at 0x000001398E79F3D0>, because it is not built. I0712 16:09:57.817487 10032 deeplab.py:145] Eval with scales ListWrapper([1.0]) I0712 16:09:57.818226 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:09:57.818226 10032 deeplab.py:153] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:09:58.143361 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:10:00.117084 10032 deeplab.py:145] Eval with scales ListWrapper([1.0]) I0712 16:10:00.118082 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:10:00.118082 10032 deeplab.py:153] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:10:00.223799 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:10:01.196901 10032 deeplab.py:145] Eval with scales ListWrapper([1.0]) I0712 16:10:01.197899 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. I0712 16:10:01.197899 10032 deeplab.py:153] Eval scale 1.0; setting pooling size to [27, 41] I0712 16:10:01.820235 10032 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended. W0712 16:10:15.740450 10032 save.py:233] Found untraced functions such as semantic_decoder_layer_call_fn, semantic_decoder_layer_call_and_return_conditional_losses, semantic_head_layer_call_fn, semantic_head_layer_call_and_return_conditional_losses, conv1_bn_act_layer_call_fn while saving (showing 5 of 408). These functions will not be directly callable after loading. INFO:tensorflow:Assets written to: .\savedmodel\assets I0712 16:10:27.721305 10032 builder_impl.py:779] Assets written to: .\savedmodel\assets

This warnings repeats hundreds of times during inference: WARNING:absl:Importing a function (__inference_internal_grad_fn_85642) with ops with unsaved custom gradients. Will likely fail if a gradient is requested

I'll appreciate your help!

aquariusjay commented 2 years ago

Hi @fschvart,

We are not sure what is happening on your end, and it is hard for us to reproduce the error. I will propose that you could play with the provided save_models (check the DeepLab_demo) and compare the provided one with your exported one.

Cheers,

fschvart commented 2 years ago

Thanks for your response! I have a pretty unique dataset so I don't think that it's going to work as good as the model that I built using my dataset.

From my understanding, it seems like the model isn't loaded well from the checkpoint and config file. (Is it possible that the code expects instance segmentation in addition to the semantic segmentation?) Would it be possible to add a model.save command (or any other thing that would help me save a model) every 5k steps just as it generates the checkpoint? I'm OK with retraining the model and adding a save command if that would help. Thanks!

camblomquist commented 2 years ago

I wanted to chime in with a similar issue that I was able to reproduce using a model trained on kitti-step, though it may be an issue elsewhere. I need my model in the onnx format. The experiment was exported into a saved_model and then converted using tf2onnx. Evaluating an image using this converted model produces garbage output. I have an onnx model from the same kitti-step experiment that works as intended, but the conversion was likely done either on a frozen graph or the checkpoints and I'm honestly unsure which or how I obtained the frozen graph.

fschvart commented 2 years ago

I wanted to chime in with a similar issue that I was able to reproduce using a model trained on kitti-step, though it may be an issue elsewhere. I need my model in the onnx format. The experiment was exported into a saved_model and then converted using tf2onnx. Evaluating an image using this converted model produces garbage output. I have an onnx model from the same kitti-step experiment that works as intended, but the conversion was likely done either on a frozen graph or the checkpoints and I'm honestly unsure which or how I obtained the frozen graph.

Were you able to get good results when running inference from the saved_model? Are you saying that the model that works well, you obtained it potentially by loading a checkpoint + config file and then instead of saving it as a saved model, you simply exported the same model to ONNX?

camblomquist commented 2 years ago

I wanted to chime in with a similar issue that I was able to reproduce using a model trained on kitti-step, though it may be an issue elsewhere. I need my model in the onnx format. The experiment was exported into a saved_model and then converted using tf2onnx. Evaluating an image using this converted model produces garbage output. I have an onnx model from the same kitti-step experiment that works as intended, but the conversion was likely done either on a frozen graph or the checkpoints and I'm honestly unsure which or how I obtained the frozen graph.

Were you able to get good results when running inference from the saved_model? Are you saying that the model that works well, you obtained it potentially by loading a checkpoint + config file and then instead of saving it as a saved model, you simply exported the same model to ONNX?

Sorry to keep you waiting. I just confirmed that I was not able to get good inference directly from the saved_model. So it seems that something is getting lost somewhere in the checkpoint->saved_model process. Exporting to ONNX from the checkpoint + config does produce the expected results. At the very least, I guess I've proved that my specific issue isn't with tf2onnx. I also get the same warnings about unsaved gradients and while it may be a red herring, this seems like as good a place as any to start questioning even if the model doesn't error out later on. I'm not suggesting you switch to ONNX of course, I'm just working with a system that already uses it.

fschvart commented 2 years ago

@camblomquist thank you so much for your comment! This is super helpful! I actually need to eventually export the model to C++, ONNX would be great :)

From what I understand, it's not a straightforward process to go from config+checkpoint to ONNX (it works with checkpoint + meta file, but no with config. That is unless I'm missing something) Did you have to deep dive into the code to make it work?

fschvart commented 2 years ago

@aquariusjay Just to add some more information, I just tried to load a model from the config file and checkpoint and run inference and , I get bad results too. From some reason there's a difference between loading the model this way and running train.py in eval mode works.

Here's the code that I used to run inference:

class DeepLabModule(tf.Module): """Class that runs DeepLab inference end-to-end."""

def init(self, config: config_pb2.ExperimentOptions, ckpt_path: Text): super().init(name='DeepLabModule')

dataset_options = config.eval_dataset_options
dataset_name = dataset_options.dataset
crop_height, crop_width = dataset_options.crop_size

config.evaluator_options.merge_semantic_and_instance_with_tf_op = False
# Disable drop path and recompute grad as they are only used in training.
config.model_options.backbone.drop_path_keep_prob = 1.0

deeplab_model = train_lib.create_deeplab_model(
    config,
    dataset.MAP_NAME_TO_DATASET_INFO[dataset_name])
meta_architecture = config.model_options.WhichOneof('meta_architecture')

# For now we only support batch size of 1 for saved model.
input_shape = train_lib.build_deeplab_model(
    deeplab_model, (crop_height, crop_width), batch_size=1)
self._input_depth = input_shape[-1]

checkpoint = tf.train.Checkpoint(**deeplab_model.checkpoint_items)
# Not all saved variables (e.g. variables from optimizer) will be restored.
# `expect_partial()` to suppress the warning.
checkpoint.restore(ckpt_path)
self._model = deeplab_model

self._preprocess_fn = functools.partial(
    input_preprocessing.preprocess_image_and_label,
    label=None,
    crop_height=crop_height,
    crop_width=crop_width,
    prev_label=None,
    min_resize_value=dataset_options.min_resize_value,
    max_resize_value=dataset_options.max_resize_value,
    resize_factor=dataset_options.resize_factor,
    is_training=False)

def get_input_spec(self): """Returns TensorSpec of input tensor needed for inference."""

We expect a single 3D, uint8 tensor with shape [height, width, channels].

return tf.TensorSpec(shape=[None, None, self._input_depth], dtype=tf.uint8)

@tf.function def call(self, input_tensor: tf.Tensor) -> MutableMapping[Text, Any]: """Performs a forward pass.

Args:
  input_tensor: An uint8 input tensor of type tf.Tensor with shape [height,
    width, channels].

Returns:
  A dictionary containing the results of the specified DeepLab architecture.
  The results are bilinearly upsampled to input size before returning.
"""
input_size = [tf.shape(input_tensor)[0], tf.shape(input_tensor)[1]]

(resized_image, processed_image, _, _, _, _) = self._preprocess_fn(image=input_tensor)

resized_size = tf.shape(resized_image)[0:2]
# Making input tensor to 4D to fit model input requirements.
outputs = self._model(tf.expand_dims(processed_image, 0), training=False)
# We only undo-preprocess for those defined in tuples in model/utils.py.
return utils.undo_preprocessing(outputs, resized_size,
                                input_size)

def get_binary_mask(ins): mask = ins['semantic_pred'][0] result_mask = np.zeros((mask.shape[0], mask.shape[1]), np.uint8) result_mask[:, :] = np.where(mask[:, :] == 1, 255, result_mask[:, :])

return result_mask

config_path = '.\configs\semseg.textproto' config = config_pb2.ExperimentOptions() with tf.io.gfile.GFile(config_path, 'r') as f: text_format.Parse(f.read(), config)

module = DeepLabModule(config, '.\trainer\ckpt\ckpt-60000') im = cv2.imread(image_path) output = module(im) pred = get_binary_mask(output)

(pred is different from the predicted label for the same image that appears in the vis folder)

fschvart commented 2 years ago

I'll really appreciate your help :)

fschvart commented 2 years ago

The issue I was facing was due to potentially some bugs in the deeplab_demo notebook code, Which is where I took my inference code from. In my case I have only 1 semantic label, so when I used the following code, I got good results with the saved_model conversion

result = LOADED_MODEL(tf.cast(image, tf.uint8)) b = result['semantic_pred'].numpy()[0] 255 mask2 = np.stack((b,) 3, axis=-1) mask2=mask2.astype('uint8') c = cv2.addWeighted(image, 1, mask2, 0.5, 0.0)

c had then the image with the correct inference of the code