smallcorgi / Faster-RCNN_TF

Faster-RCNN in Tensorflow
MIT License
2.34k stars 1.12k forks source link

Ran out of memory #51

Open zuowang opened 7 years ago

zuowang commented 7 years ago

W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 392.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:965] Internal: Dst tensor is not initialized. E tensorflow/core/common_runtime/executor.cc:390] Executor failed to create kernel. Internal: Dst tensor is not initialized. [[Node: zeros_24 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [25088,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] Traceback (most recent call last): File "./tools/train_net.py", line 96, in max_iters=args.max_iters) File "/home/deepinsight/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 222, in train_net sw.train_model(sess, max_iters) File "/home/deepinsight/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 134, in train_model sess.run(tf.initialize_all_variables()) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run run_metadata_ptr) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run feed_dict_string, options, run_metadata) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run target_list, options, run_metadata) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. [[Node: zeros_24 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [25088,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'zeros_24', defined at: File "./tools/train_net.py", line 96, in max_iters=args.max_iters) File "/home/deepinsight/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 222, in train_net sw.train_model(sess, max_iters) File "/home/deepinsight/Faster-RCNN_TF/tools/../lib/fast_rcnn/train.py", line 131, in train_model train_op = tf.train.MomentumOptimizer(lr, momentum).minimize(loss, global_step=global_step) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 279, in minimize name=name) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 393, in apply_gradients self._create_slots(var_list) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/momentum.py", line 51, in _create_slots self._zeros_slot(v, "momentum", self._name) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 593, in _zeros_slot named_slots[var] = slot_creator.create_zeros_slot(var, op_name) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 106, in create_zeros_slot val = array_ops.zeros(primary.get_shape().as_list(), dtype=dtype) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1362, in zeros output = constant(zero, shape=shape, dtype=dtype, name=name) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 169, in constant attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0] File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/deepinsight/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in init self._traceback = _extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized. [[Node: zeros_24 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [25088,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

lilhope commented 7 years ago

I'm not sure,but it may be a bug in /lib/fast-rcnn/train,line 95,line 96,wheretf.group was used, see this answer.I think use tf.dyanmic_partition will work,but you should modify the anchor_target_layer to generate an mask that feed to tf.dyanmic_partition ,see tensorflow API for more detail.
PS;when running the code,it gives an warning:UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. That usually happens when using tf.group().

odedyec commented 7 years ago

I had the same issue and I managed to fix it by adding a gpu flag in demo.py Change: sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) To config = tf.ConfigProto(allow_soft_placement=True) config.gpu_options.allow_growth = True sess = tf.Session(config=config)

wmmxk commented 7 years ago

When I train a model from scratch by running " ./experiments/scripts/faster_rcnn_end2end.sh gpu 0 VGG16 pascal_voc". I run into similar error. Is there anyone what to fix the train.py file?

odedyec commented 7 years ago

I too could not get the training to work without running out of resources.