Open dinggd opened 6 years ago
while running PUL\semi-supervised.py, during the 11th training iteration, my gpu(titanx 12.gb) ran out of memory .
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:702] Stats:
Limit: 10260823573
InUse: 9477968384
MaxInUse: 9478289152
NumAllocs: 684278185
MaxAllocSize: 2147483648
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:274] ****************************************************************************************************
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:275] Ran out of memory trying to allocate 1.53MiB. See logs for memory state.
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[16,512,7,7]
Traceback (most recent call last):
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1022, in _do_call
return fn(*args)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1004, in _run_fn
status, run_metadata)
File "C:\Python35\Lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,512,7,7]
[[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
[[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/github/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning/PUL/semi-supervised.py", line 131, in <module>
net.fit_generator(datagen.flow(images, labels, batch_size=BATCH_SIZE), steps_per_epoch=len(images)/BATCH_SIZE+1, epochs=NUM_EPOCH)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\training.py", line 1890, in fit_generator
class_weight=class_weight)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\training.py", line 1633, in train_on_batch
outputs = self.train_function(ins)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 767, in run
run_metadata_ptr)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\client\session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,512,7,7]
[[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
[[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'res5c_branch2b_21/convolution', defined at:
File "E:/github/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning/PUL/semi-supervised.py", line 122, in <module>
init_model = load_model('checkpoint/0.ckpt')
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\models.py", line 240, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\models.py", line 304, in model_from_config
return layer_module.deserialize(config, custom_objects=custom_objects)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\layers\__init__.py", line 54, in deserialize
printable_module_name='layer')
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\utils\generic_utils.py", line 140, in deserialize_keras_object
list(custom_objects.items())))
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 2416, in from_config
process_layer(layer_data)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 2411, in process_layer
layer(input_tensors[0], **kwargs)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\engine\topology.py", line 585, in __call__
output = self.call(inputs, **kwargs)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\layers\convolutional.py", line 164, in call
dilation_rate=self.dilation_rate)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 3095, in conv2d
data_format='NHWC')
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 639, in convolution
op=op)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 308, in with_space_to_batch
return op(input, num_spatial_dims, padding)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 631, in op
name=name)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 129, in _non_atrous_convolution
name=name)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 396, in conv2d
data_format=data_format, name=name)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 763, in apply_op
op_def=op_def)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "E:\github\Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1264, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,512,7,7]
[[Node: res5c_branch2b_21/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](activation_47_21/Relu, res5c_branch2b_21/kernel/read)]]
[[Node: bn5c_branch2b_21/moments/sufficient_statistics/Gather/_79053 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12358_bn5c_branch2b_21/moments/sufficient_statistics/Gather", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Process finished with exit code 1
Hi , i also see the problem above.Did the author fix the prolbem?
As I do clustering and finetuning stages, it seems like that the GPU would have some kind of memory leak after few epoches, say 15.
And the log looks like this:
Just wondering if you have ever run into this, it reappears every time when i train.
Hope that you can help fix this minor problem.
Thanks,