hanzhanggit / StackGAN

MIT License
1.86k stars 457 forks source link

I have an ERROR with STAGE II training. #15

Open seongkyun opened 7 years ago

seongkyun commented 7 years ago

I wrote $ python run_exp_stage1.py --cfg stageI/cfg/birds.yml --gpu 1

and prints out

............................. I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 117254912 totalling 111.82MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 10 Chunks of size 134217728 totalling 1.25GiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 214433792 totalling 204.50MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 268435456 totalling 768.00MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 276824064 totalling 264.00MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 318767104 totalling 304.00MiB I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 5.20GiB I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: Limit: 5632950272 InUse: 5579675904 MaxInUse: 5631752192 NumAllocs: 3795 MaxAllocSize: 1478306560

W tensorflow/core/common_runtime/bfc_allocator.cc:274] **** W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 32.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[64,32,32,128] Traceback (most recent call last): File "run_exp_stage2.py", line 71, in algo.train() File "/home/han/StackGAN/stageII/trainer.py", line 506, in train log_vars, sess) File "/home/han/StackGAN/stageII/trainer.py", line 447, in train_one_step ret_list = sess.run(feed_out_d, feed_dict) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1024,4,4] [[Node: custom_conv2d_5_3/custom_conv2d/custom_conv2d_5/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](apply_5_3/apply/Maximum, hr_d_net/custom_conv2d_5/custom_conv2d_5/w/read)]] [[Node: Adam_2/update/_184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_19689_Adam_2/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'custom_conv2d_5_3/custom_conv2d/custom_conv2d_5/Conv2D', defined at: File "run_exp_stage2.py", line 71, in algo.train() File "/home/han/StackGAN/stageII/trainer.py", line 463, in train counter = self.build_model(sess) File "/home/han/StackGAN/stageII/trainer.py", line 380, in build_model self.init_opt() File "/home/han/StackGAN/stageII/trainer.py", line 142, in init_opt flag='hr') File "/home/han/StackGAN/stageII/trainer.py", line 178, in compute_losses self.model.hr_get_discriminator(images, embeddings) File "/home/han/StackGAN/stageII/model.py", line 314, in hr_get_discriminator x_code = self.hr_d_image_template.construct(input=x_var) # s16 s16 df_dim8 File "/home/han/anaconda2/lib/python2.7/site-packages/prettytensor/pretty_tensor_class.py", line 1248, in construct return self._construct(context) File "/home/han/anaconda2/lib/python2.7/site-packages/prettytensor/scopes.py", line 158, in call return self._call_func(args, kwargs) File "/home/han/anaconda2/lib/python2.7/site-packages/prettytensor/scopes.py", line 131, in _call_func return self._func(args, *kwargs) File "/home/han/anaconda2/lib/python2.7/site-packages/prettytensor/pretty_tensor_class.py", line 1924, in _with_method_complete return input_layer._method_complete(func(args, **kwargs)) File "/home/han/StackGAN/misc/custom_ops.py", line 82, in call conv = tf.nn.conv2d(input_layer.tensor, w, strides=[1, d_h, d_w, 1], padding=padding) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d data_format=data_format, name=name) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/han/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in init self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,1024,4,4] [[Node: custom_conv2d_5_3/custom_conv2d/custom_conv2d_5/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](apply_5_3/apply/Maximum, hr_d_net/custom_conv2d_5/custom_conv2d_5/w/read)]] [[Node: Adam_2/update/_184 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_19689_Adam_2/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

I have never seen like this error before. please help me :(

this code is implemented on Ubuntu 16.04, Tensorflow r1.0.1, CUDA 8.0, cuDNN 5.1

SpadesQ commented 6 years ago

@seongkyun Your GPU memory may not enough make sure 12GB