thatbrguy / Pedestrian-Detection

Pedestrian detection using the TensorFlow Object Detection API. Includes multi GPU parallel processing inference.
358 stars 134 forks source link

Which parameters to reduce to avoid ResourceExhaustedError #3

Closed MounirB closed 6 years ago

MounirB commented 6 years ago

Hello, I try to train the faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 model by launching the train.py script on it, but I get the following ResourceExhaustedError. Do you have any idea on how to solve it ? I tried to change many parameters in pipeline.config, but It doesn't change anything

2018-10-10 14:54:05.313837: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.25GiB 2018-10-10 14:54:05.313845: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: Limit: 1363345408 InUse: 1338755072 MaxInUse: 1350130944 NumAllocs: 3937 MaxAllocSize: 256131072

2018-10-10 14:54:05.313921: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **** 2018-10-10 14:54:05.313944: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc INFO:tensorflow:Error reported to Coordinator: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Traceback (most recent call last): File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop self._sv.global_step]) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last): File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 995, in managed_session yield sess File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D', defined at: File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in tf.app.run() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, kwargs) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/models/faster_rcnn_inception_resnet_v2_feature_extractor.py", line 112, in _extract_proposal_features align_feature_maps=True)) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/nets/inception_resnet_v2.py", line 232, in inception_resnet_v2_base scope='Conv2d_1a_3x3') File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, *current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(args, current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution outputs = layer.apply(inputs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply return self.call(inputs, *args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call outputs = super(Layer, self).call(inputs, *args, *kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call outputs = self.call(inputs, args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call outputs = self._convolution_op(inputs, self.kernel) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call return self.conv_op(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call return self.call(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call name=self.name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in tf.app.run() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 332, in train saver=saver) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train ignore_live_threads=ignore_live_threads) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/contextlib.py", line 99, in exit self.gen.throw(type, value, traceback) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1005, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop ignore_live_threads=ignore_live_threads) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop self._sv.global_step]) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Process finished with exit code 1

thatbrguy commented 6 years ago

Changing parameters (besides batch size) won't help your case that much if you're using pretrained models. The model faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 is pretty large. I would suggest using a smaller model such as FasterRCNN_ResNet50 or SSD_MobileNet.

MounirB commented 6 years ago

Same problem occurring again, even with SSD_MobileNet :/ I have a P400 GPU

2018-10-18 16:09:49.781181: W tensorflow/core/common_runtime/bfcallocator.cc:279] *____*****__**xxxxx 2018-10-18 16:09:49.781198: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at: File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in tf.app.run() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, kwargs) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict preprocessed_inputs) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features scope=scope) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base scope=end_point) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, *current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(args, current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution outputs = layer.apply(inputs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply return self.call(inputs, *args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call outputs = super(Layer, self).call(inputs, *args, *kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call outputs = self.call(inputs, args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call outputs = self._convolution_op(inputs, self.kernel) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call return self.conv_op(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call return self.call(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call name=self.name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last): File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in tf.app.run() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 332, in train saver=saver) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at: File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in tf.app.run() File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main worker_job_name, is_chief, FLAGS.train_dir) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, kwargs) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict preprocessed_inputs) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features scope=scope) File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base scope=end_point) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(*args, *current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args return func(args, current_args) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution outputs = layer.apply(inputs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply return self.call(inputs, *args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call outputs = super(Layer, self).call(inputs, *args, *kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call outputs = self.call(inputs, args, kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call outputs = self._convolution_op(inputs, self.kernel) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call return self.conv_op(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call return self.call(inp, filter) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call name=self.name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

thatbrguy commented 6 years ago

Oh alright. You can try using Google Colab to train them then. You can fit FasterRCNN+ResNet-50 (and other models with similar param count) over there.