rishizek / tensorflow-deeplab-v3-plus

DeepLabv3+ built in TensorFlow
MIT License
833 stars 307 forks source link

I can't not run the train.py #39

Open sori0528 opened 5 years ago

sori0528 commented 5 years ago

The following problems occur when I run the train.py file.

File "train.py", line 285, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 267, in main hooks=train_hooks, File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/home/algolab/HDD/LHHS/DeepLab_v3_plus/tensorflow-deeplab-v3-plus-master/deeplab_model.py", line 172, in deeplabv3_plus_model_fn logits = network(features, mode == tf.estimator.ModeKeys.TRAIN) File "/home/algolab/HDD/LHHS/DeepLab_v3_plus/tensorflow-deeplab-v3-plus-master/deeplab_model.py", line 129, in model {v.name.split(':')[0]: v for v in variables_to_restore}) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 187, in init_from_checkpoint _init_from_checkpoint, ckpt_dir_or_file, assignment_map) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/distribute.py", line 1053, in merge_call return self._merge_call(merge_fn, *args, *kwargs) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/distribute.py", line 1061, in _merge_call return merge_fn(self._distribution_strategy, args, kwargs) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 194, in _init_from_checkpoint reader = load_checkpoint(ckpt_dir_or_file) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint return pywrap_tensorflow.NewCheckpointReader(filename) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 326, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/home/algolab/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for PRE_TRAINED_MODEL

I have resnet_v2_101.ckpt in the following path. Why does this happen?

parser.add_argument('--pre_trained_model', type=str, default='./ini_checkpoints/resnet_v2_101/resnet_v2_101.ckpt', help='Path to the pre-trained model checkpoint.')

juanmanuelrq commented 5 years ago

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 285, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 267, in main hooks=train_hooks, File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default saving_listeners) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(original_exc_info) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(args, *kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(args, **kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,1024,33,33] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v2_101/block3/unit_1/bottleneck_v2/conv2/Relu, resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/weights/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node gradients/DynamicPartition_grad/range/_7773}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_35384_gradients/DynamicPartition_grad/range", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/Conv2D', defined at: File "train.py", line 285, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 267, in main hooks=train_hooks, File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/media/inn/Files/2019/20190215_ea/dgLand/tensorflow-deeplab-v3-plus/deeplab_model.py", line 172, in deeplabv3_plus_model_fn logits = network(features, mode == tf.estimator.ModeKeys.TRAIN) File "/media/inn/Files/2019/20190215_ea/dgLand/tensorflow-deeplab-v3-plus/deeplab_model.py", line 123, in model output_stride=output_stride) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_v2.py", line 313, in resnet_v2_101 scope=scope) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_v2.py", line 216, in resnet_v2 net = resnet_utils.stack_blocks_dense(net, blocks, output_stride) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(args, current_args) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_utils.py", line 211, in stack_blocks_dense net = block.unit_fn(net, rate=rate, dict(unit, stride=1)) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(args, current_args) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_v2.py", line 123, in bottleneck scope='conv3') File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(*args, current_args) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d conv_dims=2) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(*args, *current_args) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution outputs = layer.apply(inputs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 828, in apply return self.call(inputs, args, kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 364, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 769, in call outputs = self.call(inputs, *args, *kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call outputs = self._convolution_op(inputs, self.kernel) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 869, in call return self.conv_op(inp, filter) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 521, in call return self.call(inp, filter) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 205, in call name=self.name) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, kwargs) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op op_def=op_def) File "/home/inn/anaconda3/envs/dgland/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10,1024,33,33] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v2_101/block3/unit_1/bottleneck_v2/conv2/Relu, resnet_v2_101/block3/unit_1/bottleneck_v2/conv3/weights/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node gradients/DynamicPartition_grad/range/_7773}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_35384_gradients/DynamicPartition_grad/range", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

northeastsquare commented 5 years ago

decrease batchsize, try 1 @juanmanuelrq

northeastsquare commented 5 years ago

@sori0528 specify pre_trained_model: python train.py --pre_trained_model ./ini_checkpoints/resnet_v2_101/resnet_v2_101.ckpt and also tf.logging.info( pre_trained_model), before tf.train.init_from_checkpoint

prativadas commented 4 years ago

@northeastsquare hi I am trying to run this code and getting system exit error while running create_pascal_tf_record.py --------------------------------------------------------------------------- SystemExit Traceback (most recent call last)

in 38 tf.logging.set_verbosity(tf.logging.INFO) 39 FLAGS, unparsed = parser.parse_known_args() ---> 40 tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) and I guess while running train.py, the error occurs at that line in main function. can you please what can be done?
Curry-Christopher commented 4 years ago

decrease batchsize, try 1 @juanmanuelrq

Great! I meet the same problem. Solved!