Running train.py, fail to allocate memory.

shensq0814 commented 7 years ago

I tried to run the train.py to train srgan. But the program terminates since there is not enough memory. My GPU is 860M with 2G memory.

How much memory exactly does the program need? Is there any way to reduce the memory needed. I tried to change the batch size but with no effect.

Thank you.

tadax commented 7 years ago

I think 2 GB is enough. I tried to limit memory usage by using tf.ConfigProto and it run (batch size = 8, memory consumption = 1833 MB).

Are you using cuDNN v5.1?

shensq0814 commented 7 years ago

Yes, CUDA 8.0 with cuDNN 5.1. The available memory on my computer is about 1 3GB.

I notice that you use all the features in the VGG, which is different from the orginal paper. Could it be the reason why the model need that much memory?

tadax commented 7 years ago

The available memory on my computer is about 1 3GB.

1.3 GB?

Could it be the reason why the model need that much memory?

I think SRGAN needs much memory as it builds Generator (ResNet), Discrimitator, and VGG19.

As you said, it might have an effect on reducing memory usage. Modify inference_content_loss as follows:

def inference_content_loss(x, imitation):
    _, x_phi = self.vgg.build_model(
        x, tf.constant(False), False)
    _, imitation_phi = self.vgg.build_model(
        imitation, tf.constant(False), True)
   content_loss = tf.nn.l2_loss(x_phi[4] - imitation_phi[4]) # phi54
   return tf.reduce_mean(content_loss)

shensq0814 commented 7 years ago

I've installed the environment needed on another computer with enough memory. However I get another error when the first epoch finished.

Caused by op 'generator/deconv1/conv2d_transpose', defined at: File "train.py", line 95, in train() File "train.py", line 18, in train model = SRGAN(x, is_training, batch_size) File "/home/min/ssq/srgan/src/srgan.py", line 14, in init self.imitation = self.generator(self.downscaled, is_training, False) File "/home/min/ssq/srgan/src/srgan.py", line 25, in generator x, [3, 3, 64, 3], [self.batch_size, 24, 24, 64], 1) File "../utils/layer.py", line 43, in deconv_layer strides=[1, stride, stride, 1]) File "/home/min/anaconda/envs/shen/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1104, in conv2d_transpose name=name) File "/home/min/anaconda/envs/shen/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 496, in conv2d_backprop_input data_format=data_format, name=name) File "/home/min/anaconda/envs/shen/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op op_def=op_def) File "/home/min/anaconda/envs/shen/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/min/anaconda/envs/shen/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1228, in init self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Conv2DSlowBackpropInput: input and out_backprop must have the same batch size

tadax commented 7 years ago

Fix on line 45 of src/train.py

True: n_iter = int(len(x_train) / batch_size)

False: n_iter = int(np.ceil(len(x_train) / batch_size))

shensq0814 commented 7 years ago

The implementation of your generator seems different from the paper where only last two layers are deconvolution layers(they changed into sub-pixel CNN recently). You used deconv_layer in all of the residual blocks. Is that a mistake or you intended to?

jzrita commented 6 years ago

Hi, Tadax, yes I have the same concern as @Doodleyard . Although in the CVPR paper the final published generator network is different from their arXiv version, from your code is neither of them. Do you mind to give us some hints? thank you.

tadax / srgan

Running train.py, fail to allocate memory. #6