tensorflow / models

Models and examples built with TensorFlow
Other
77k stars 45.79k forks source link

Memory leak in SSDMetaArch #9981

Open dansitu opened 3 years ago

dansitu commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/38f1ebe031544418a36edda387c9234145480e53/research/object_detection/meta_architectures/ssd_meta_arch.py#L525

2. Describe the bug

The 'predict' method of SSDMetaArch has a memory leak. It's leaking 100-200MB for each batch of 32 320x320 images during my training loop.

3. Steps to reproduce

The leak can be observed in this official tutorial:

https://github.com/tensorflow/models/blob/38f1ebe031544418a36edda387c9234145480e53/research/object_detection/colab_tutorials/eager_few_shot_od_training_tflite.ipynb

To see it happening, just add memory-profiler and decorate train_step_fn with @profile.

The leak happens regardless of whether execution occurs with a GradientTape.

4. Expected behavior

RAM usage should not increase between calls to predict.

5. Additional context

Here's a sample memory-profiler dump from my own version of the notebook (not identical to the one linked to above):

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   142   1549.4 MiB   1549.4 MiB           1       @profile
   143                                             def train_step_fn(image_tensors,
   144                                                                 groundtruth_boxes_list,
   145                                                                 groundtruth_classes_list):
   146                                                 """A single training iteration.
   147                                         
   148                                                 Args:
   149                                                 image_tensors: A list of [1, height, width, 3] Tensor of type tf.float32.
   150                                                     Note that the height and width can vary across images, as they are
   151                                                     reshaped within this function to be 320x320.
   152                                                 groundtruth_boxes_list: A list of Tensors of shape [N_i, 4] with type
   153                                                     tf.float32 representing groundtruth boxes for each image in the batch.
   154                                                 groundtruth_classes_list: A list of Tensors of shape [N_i, num_classes]
   155                                                     with type tf.float32 representing groundtruth boxes for each image in
   156                                                     the batch.
   157                                         
   158                                                 Returns:
   159                                                 A scalar tensor representing the total loss for the input batch.
   160                                                 """
   161   1549.4 MiB      0.0 MiB           1           shapes = tf.constant(len(image_tensors) * [[320, 320, 3]], dtype=tf.int32)
   162   1549.4 MiB      0.0 MiB           1           model.provide_groundtruth(
   163   1549.4 MiB      0.0 MiB           1               groundtruth_boxes_list=groundtruth_boxes_list,
   164   1549.4 MiB      0.0 MiB           1               groundtruth_classes_list=groundtruth_classes_list)
   165                                                     # The images each have a pointless batch dimension of 1, so do a reshape
   166                                                     # to remove this from the result of concatenation
   167   1549.4 MiB      0.0 MiB           1           concatted = tf.reshape(tf.concat(image_tensors, axis=0), (len(image_tensors), 320, 320, 3))
   168   1737.8 MiB    188.4 MiB           1           prediction_dict = model.predict(concatted, shapes)
   169   1737.8 MiB      0.0 MiB           1           losses_dict = model.loss(prediction_dict, shapes)
   170   1737.8 MiB      0.0 MiB           1           total_loss = losses_dict['Loss/localization_loss'] + losses_dict['Loss/classification_loss']
   180   1737.8 MiB      0.0 MiB           1           return total_loss

6. System information

Dockerfile tensorflow/tensorflow:2.4.1-gpu

dansitu commented 3 years ago

Any updates on this one?

dansitu commented 3 years ago

@ymodak/@pkulzc are there any updates on this issue? Happy to help with debugging.

dansitu commented 3 years ago

I tried upgrading to TF2.6 but it does not help with the memory leak.

janjongboom commented 3 years ago

@ymodak Any updates here?