google / automl

Google Brain AutoML
Apache License 2.0
6.25k stars 1.45k forks source link

Train error #39

Closed WonTaeYeon closed 4 years ago

WonTaeYeon commented 4 years ago

Hi, thank you for your hard work and open sourcing the code! I tried training, but the following error occurred.

Command: python main.py --training_file_pattern=tmp/train/train* --model_name=efficientdet-d0 --model_dir=train_model --hparams="use_bfloat16=false" --use_tpu=False

2020-03-25 11:18:29.528543: W tensorflow/core/common_runtime/bfc_allocator.cc:429] **** 2020-03-25 11:18:29.528632: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cwise_ops_common.h:263 : Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc WARNING:tensorflow:Reraising captured error W0325 11:18:30.869304 140049157998336 error_handling.py:142] Reraising captured error Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call return fn(*args) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn target_list, run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 385, in tf.app.run(main) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "main.py", line 246, in main FLAGS.train_batch_size)) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train rendezvous.raise_errors() File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors six.reraise(typ, value, traceback) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train saving_listeners=saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1497, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 778, in run run_metadata=run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1283, in run run_metadata=run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in run raise six.reraise(original_exc_info) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1369, in run return self._sess.run(args, *kwargs) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1442, in run run_metadata=run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1200, in run return self._sess.run(args, **kwargs) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run run_metadata_ptr) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run feed_dict_tensor, options, run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run run_metadata) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Original stack trace for 'efficientnet-b0/model/blocks_13/Sigmoid': File "main.py", line 385, in tf.app.run(main) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "main.py", line 246, in main FLAGS.train_batch_size)) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train saving_listeners=saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn config) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn features, labels, is_export_mode=is_export_mode) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu return self._call_model_fn(features, labels, is_export_mode=is_export_mode) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn estimator_spec = self._model_fn(features=features, kwargs) File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 567, in efficientdet_model_fn model=efficientdet_arch.efficientdet) File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 399, in _model_fn cls_outputs, box_outputs = _model_outputs() File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 389, in _model_outputs return model(features, config=hparams_config.Config(params)) File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 552, in efficientdet features = build_backbone(features, config) File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 328, in build_backbone override_params=override_params) File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_builder.py", line 324, in build_model_base features = model(images, training=training, features_only=True) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 778, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 643, in call for idx, block in enumerate(self._blocks): File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 339, in for_stmt return _py_forstmt(iter, extra_test, body, get_state, set_state, init_vars) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 350, in _py_for_stmt state = body(target, state) File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 662, in call outputs = block.call( File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 363, in call if self._block_args.fused_conv: File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt return _py_if_stmt(cond, body, orelse) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt return body() if cond else orelse() File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 369, in call if self._block_args.expand_ratio != 1: File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt return _py_if_stmt(cond, body, orelse) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt return body() if cond else orelse() File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 370, in call x = self._relu_fn(self._bn0(expand_conv_fn(x), training=training)) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 256, in call return self._d(self._f, a, k) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 212, in decorated return _graph_mode_decorator(wrapped, args, kwargs) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 316, in _graph_mode_decorator result, grad_fn = f(args) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py", line 534, in swish return features math_ops.sigmoid(features), grad File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 3154, in sigmoid return gen_math_ops.sigmoid(x, name=name) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 8750, in sigmoid "Sigmoid", x=x, name=name) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal op_def=op_def) File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init self._traceback = tf_stack.extract_stack()

CraigWang1 commented 4 years ago

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2 eg. --train_batch_size 32

WonTaeYeon commented 4 years ago

Solved, Thx

qtw1998 commented 4 years ago

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2 eg. --train_batch_size 32

Helped!thx

qtw1998 commented 4 years ago

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2 eg. --train_batch_size 32

but I use 8 * 2080ti use bs = 4 still have the same OOM problem

mingxingtan commented 4 years ago

This issue is similar to https://github.com/google/automl/issues/85. I am going to close this one and keep that open. Feel free to add your comments to #85 if you still have problems.