hehefan / Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning

219 stars 87 forks source link

memory leak issue #6

Open dinggd opened 6 years ago

dinggd commented 6 years ago

As I do clustering and finetuning stages, it seems like that the GPU would have some kind of memory leak after few epoches, say 15.

And the log looks like this:

2017-11-28 04:34:09.669978: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 75 Chunks of size 37632 totalling 2.69MiB 2017-11-28 04:34:09.669990: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 44800 totalling 43.8KiB 2017-11-28 04:34:09.670002: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 17 Chunks of size 60672 totalling 1007.2KiB 2017-11-28 04:34:09.670014: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 544 Chunks of size 65536 totalling 34.00MiB 2017-11-28 04:34:09.670026: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 29 Chunks of size 81920 totalling 2.27MiB 2017-11-28 04:34:09.670038: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 119040 totalling 116.2KiB 2017-11-28 04:34:09.670049: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 122368 totalling 119.5KiB 2017-11-28 04:34:09.670061: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 121 Chunks of size 131072 totalling 15.12MiB 2017-11-28 04:34:09.670073: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 226 Chunks of size 147456 totalling 31.78MiB 2017-11-28 04:34:09.670085: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 163840 totalling 160.0KiB 2017-11-28 04:34:09.670097: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 17 Chunks of size 178688 totalling 2.90MiB 2017-11-28 04:34:09.670109: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 641 Chunks of size 262144 totalling 160.25MiB 2017-11-28 04:34:09.670121: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 263424 totalling 257.2KiB 2017-11-28 04:34:09.670132: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 8 Chunks of size 307712 totalling 2.35MiB 2017-11-28 04:34:09.670144: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 327680 totalling 320.0KiB 2017-11-28 04:34:09.670155: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 212 Chunks of size 524288 totalling 106.00MiB 2017-11-28 04:34:09.670167: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 344 Chunks of size 589824 totalling 193.50MiB 2017-11-28 04:34:09.670179: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 671744 totalling 656.0KiB

Just wondering if you have ever run into this, it reappears every time when i train.

Hope that you can help fix this minor problem.

Thanks,

ShiinaMitsuki commented 6 years ago

while running PUL\semi-supervised.py, during the 11th training iteration, my gpu(titanx 12.gb) ran out of memory .

I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:702] Stats: 
Limit:                 10260823573
InUse:                  9477968384
MaxInUse:               9478289152
NumAllocs:               684278185
MaxAllocSize:           2147483648

W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:274] ****************************************************************************************************
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:275] Ran out of memory trying to allocate 1.53MiB.  See logs for memory state.
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[16,512,7,7]
Traceback (most recent call last):
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1022, in _do_call
    return fn(*args)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1004, in _run_fn
    status, run_metadata)
  File "C:\Python35\Lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,512,7,7]
     [[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
     [[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/github/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning/PUL/semi-supervised.py", line 131, in <module>
    net.fit_generator(datagen.flow(images, labels, batch_size=BATCH_SIZE), steps_per_epoch=len(images)/BATCH_SIZE+1, epochs=NUM_EPOCH)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\training.py", line 1890, in fit_generator
    class_weight=class_weight)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 767, in run
    run_metadata_ptr)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,512,7,7]
     [[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
     [[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op 'res5c_branch2b_21/convolution', defined at:
  File "E:/github/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning/PUL/semi-supervised.py", line 122, in <module>
    init_model = load_model('checkpoint/0.ckpt')
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\models.py", line 240, in load_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\models.py", line 304, in model_from_config
    return layer_module.deserialize(config, custom_objects=custom_objects)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\layers\__init__.py", line 54, in deserialize
    printable_module_name='layer')
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\utils\generic_utils.py", line 140, in deserialize_keras_object
    list(custom_objects.items())))
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 2416, in from_config
    process_layer(layer_data)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 2411, in process_layer
    layer(input_tensors[0], **kwargs)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 585, in __call__
    output = self.call(inputs, **kwargs)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\layers\convolutional.py", line 164, in call
    dilation_rate=self.dilation_rate)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 3095, in conv2d
    data_format='NHWC')
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 639, in convolution
    op=op)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 308, in with_space_to_batch
    return op(input, num_spatial_dims, padding)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 631, in op
    name=name)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 129, in _non_atrous_convolution
    name=name)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 396, in conv2d
    data_format=data_format, name=name)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,512,7,7]
     [[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
     [[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Process finished with exit code 1

XrsSjtu commented 6 years ago

Hi , i also see the problem above.Did the author fix the prolbem?