manumathewthomas / ImageDenoisingGAN

Image Denoising with Generative Adversarial Network
362 stars 104 forks source link

How to train the model? #4

Open TrinhQuocNguyen opened 6 years ago

TrinhQuocNguyen commented 6 years ago

Hello manumathewthomas, Thank your for your code, I am trying to train the model from scratch, but met this problem, could you show how to solve it?

Traceback (most recent call last):
  File "train.py", line 91, in <module>
    train()
  File "train.py", line 23, in train
    Dg = discriminator(Gz, reuse=True)
  File "/home/ubuntu/trinh/Edited_ImageDenoisingGAN /model.py", line 32, in discriminator
    conv1, conv1_weights = conv_layer(input, 4, 3, 48, 2, "d_conv1", reuse=reuse)
  File "/home/ubuntu/trinh/Edited_ImageDenoisingGAN /conv_helper.py", line 10, in conv_layer
    output = slim.batch_norm(output)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 643, in batch_norm
    outputs = layer.apply(inputs, training=is_training)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 671, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 559, in __call__
    self.build(input_shapes[0])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/normalization.py", line 201, in build
    trainable=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 458, in add_variable
    trainable=trainable and self.trainable)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1203, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1092, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 417, in get_variable
    return custom_getter(**custom_getter_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1539, in layer_variable_getter
    return _model_variable_getter(getter, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1531, in _model_variable_getter
    custom_getter=getter, use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 262, in model_variable
    use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 217, in variable
    use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
    use_resource=use_resource, constraint=constraint)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 742, in _get_single_variable
    name, "".join(traceback.format_list(tb))))
ValueError: Variable d_conv1/BatchNorm/beta already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 217, in variable
    use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 262, in model_variable
    use_resource=use_resource)
manumathewthomas commented 6 years ago

Are you running it on Tensorflow V1.0 ?

Toni-Chan commented 6 years ago

I have faced the same problem. Tensorflow 1.8.0.

manumathewthomas commented 6 years ago

Can you try with Tensorflow 1.0

phaniavi commented 5 years ago

@Toni-Chan did you solve the above issue?

Toni-Chan commented 5 years ago

It seemed that it could work, however it would always crash after some fifteen cycles of training. It doesn't quite fit my task so I dropped

coco1549134149 commented 5 years ago

I faced the same wrong with tensorflow1.10

coco1549134149 commented 5 years ago

@TrinhQuocNguyen how you solve the problem? I an new to this.

wish829 commented 5 years ago

I faced the same wrong with tensorflow1.10

hi ,I have the same problem with tensorflow1.10 . Have you solved it now?

coco1549134149 commented 5 years ago

I faced the same wrong with tensorflow1.10

hi ,I have the same problem with tensorflow1.10 . Have you solved it now?

hi,I add some code in the conv_layer of conv_helper.py:

def conv_layer(input_image, ksize, in_channels, out_channels, stride, scope_name, activation_function=lrelu, reuse=False): with tf.variable_scope(scope_name): if reuse: tf.get_variable_scope().reuse_variables() (Generally this is the case) filter = tf.Variable(tf.random_normal([ksize, ksize, in_channels, out_channels], stddev=0.03)) output = tf.nn.conv2d(input_image, filter, strides=[1, stride, stride, 1], padding='SAME') output = slim.batch_norm(output) if activation_function: output = activation_function(output) return output, filter

but, when I train this code the Loss become very large like -46595465644.

kaushiksk commented 5 years ago

Facing the same issue on tensorflow v1.1.

firdameng commented 5 years ago

you can try it that adding reuse=reuse in function conv_layer @kaushiksk @wish829 with tf.variable_scope(scope_name, reuse=reuse):

wish829 commented 5 years ago

you can try it that adding reuse=reuse in function conv_layer @kaushiksk @wish829 with tf.variable_scope(scope_name, reuse=reuse):

你好,非常感谢回复,我已经解决了这个问题,但碰到了另一个问题,就是运行一段时间出现“GraphDef cannot be larger than 2GB”这个报错,不知道你是否遇到,有什么解决办法吗?

firdameng commented 5 years ago

a problem which may be like yours is troubling me,but I have no idea for solving it now. @wish829 ` Step 12000/100000 Gen Loss: 12350353000.0 Disc Loss: 1.4004402 PSNR: 26.14135902868828 SSIM: 0.8665272159942402 Step 12010/100000 Gen Loss: 12292836000.0 Disc Loss: 1.400274 PSNR: 25.94992507552601 SSIM: 0.8655586891626611 Step 12020/100000 Gen Loss: 15851811000.0 Disc Loss: 1.4003873 PSNR: 26.15803991559578 SSIM: 0.8670629243999279 Step 12030/100000 Gen Loss: 17971567000.0 Disc Loss: 1.4054713 PSNR: 25.995083258334443 SSIM: 0.8650341726544077 Step 12040/100000 Gen Loss: 11211838000.0 Disc Loss: 1.4014628 PSNR: 26.222578627884307 SSIM: 0.8671878400058987 Step 12050/100000 Gen Loss: 11266576000.0 Disc Loss: 1.4024365 PSNR: 26.241693138333112 SSIM: 0.8671688884932088 Step 12060/100000 Gen Loss: 17548194000.0 Disc Loss: 1.4003773 PSNR: 26.00901044193707 SSIM: 0.865215676049251 Step 12070/100000 Gen Loss: 23370770000.0 Disc Loss: 1.400408 PSNR: 26.120438055008826 SSIM: 0.8658342743148607 Step 12080/100000 Gen Loss: 10717686000.0 Disc Loss: 1.400349 PSNR: 26.106458721349533 SSIM: 0.8668624241359252 Step 12090/100000 Gen Loss: 11456956000.0 Disc Loss: 1.4003404 PSNR: 26.14799384579858 SSIM: 0.8670219117032115 Step 12100/100000 Gen Loss: 16212880000.0 Disc Loss: 1.4002614 PSNR: 26.113362434651755 SSIM: 0.8664612569021332 Step 12110/100000 Gen Loss: 17638543000.0 Disc Loss: 1.4002542 PSNR: 26.076150132750907 SSIM: 0.8671447135213092 Step 12120/100000 Gen Loss: 11208785000.0 Disc Loss: 1.4002402 PSNR: 26.08032439059292 SSIM: 0.8651132589262946 Step 12130/100000 Gen Loss: 10916414000.0 Disc Loss: 1.4002037 PSNR: 25.984372670853546 SSIM: 0.8638963023554623 Step 12140/100000 Gen Loss: 18496344000.0 Disc Loss: 1.4002202 PSNR: 26.088072032465114 SSIM: 0.8653338786023229 Step 12150/100000 Gen Loss: 16589135000.0 Disc Loss: 1.4002035 PSNR: 25.83919717160348 SSIM: 0.8617282185640999 Step 12160/100000 Gen Loss: 12576207000.0 Disc Loss: 1.4002212 PSNR: 26.20429954486186 SSIM: 0.8678690929255833 Step 12170/100000 Gen Loss: 10253853000.0 Disc Loss: 1.4001697 PSNR: 26.09443495684801 SSIM: 0.8662818556931983 Step 12180/100000 Gen Loss: 15371867000.0 Disc Loss: 1.4002001 PSNR: 26.039226803571175 SSIM: 0.8658360602984164 Step 12190/100000 Gen Loss: 23898941000.0 Disc Loss: 1.4022608 PSNR: 25.946668566224393 SSIM: 0.8640443029233819 Step 12200/100000 Gen Loss: 30478232000.0 Disc Loss: 1.4523464 PSNR: 25.13610710166375 SSIM: 0.8386938824245166 Step 12210/100000 Gen Loss: 28219340000.0 Disc Loss: 1.4221447 PSNR: 25.567860541426697 SSIM: 0.8482851503651547 Step 12220/100000 Gen Loss: 22672560000.0 Disc Loss: 1.4141589 PSNR: 25.98554835956014 SSIM: 0.8506765904464207 Step 12230/100000 Gen Loss: 24188324000.0 Disc Loss: 1.4071617 PSNR: 26.35134289574428 SSIM: 0.8587045622759665 Step 12240/100000 Gen Loss: 13836016000.0 Disc Loss: 1.4070854 PSNR: 26.488961297255237 SSIM: 0.8696061696106584

[libprotobuf FATAL external/protobuf_archive/src/google/protobuf/message_lite.cc:68] CHECK failed: (byte_size_before_serialization) == (byte_size_after_serialization): tensorflow.GraphDef was modified concurrently during serialization. terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: (byte_size_before_serialization) == (byte_size_after_serialization): tensorflow.GraphDef was modified concurrently during serialization. `

firdameng commented 5 years ago

@wish829 I solve my problem by deleting the Graphs directory directly which is in root mode,but the training process often collapses. 2019-02-26 16:55:50.330234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0 2019-02-26 16:55:50.473725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-26 16:55:50.473751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0 2019-02-26 16:55:50.473756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N 2019-02-26 16:55:50.473909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7311 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) Step 10/100000 Gen Loss: 37582434000.0 Disc Loss: 1.5633842 PSNR: 20.424394266955158 SSIM: 0.7810406358624654 Step 20/100000 Gen Loss: 35300233000.0 Disc Loss: 1.5486042 PSNR: 20.080686554357992 SSIM: 0.7819879789024556 Step 30/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627 Step 40/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627 Step 50/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627 Step 60/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627

fourteen14fourteen commented 5 years ago

you can try it that adding reuse=reuse in function conv_layer @kaushiksk @wish829 with tf.variable_scope(scope_name, reuse=reuse):

i download vgg16.tfmodel in other place,when i run python train.py ,error happen:\

Traceback (most recent call last): File "/home/usst/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 523, in import_graph_def ret.append(name_to_op[operation_name].outputs[output_index]) KeyError: 'conv2_2/conv2_2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 91, in train() File "train.py", line 33, in train

qiongshuai commented 5 years ago

@firdameng why Gen loss and Disc Loss are nan? Thank Step 90/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627 Step 100/100000 Gen Loss: nan Disc Loss: nan PSNR: 5.7679735050889995 SSIM: 0.00017517162219731627

stefenmax commented 4 years ago

Hi guys, The CKPT FILE and Dataset are invalid now, could you send me if it is possible. Thank!

Tian14267 commented 4 years ago

Do you have dataset file? Can you send it to me ? 请问你有训练数据集吗?能分享一份给我吗?非常感谢 @TrinhQuocNguyen @manumathewthomas @Toni-Chan @phaniavi @coco1549134149 @stefenmax @qiongshuai @fourteen14fourteen @firdameng @wish829

hNeji commented 4 years ago

did you solve this problem please ?

SolicTous commented 4 years ago

Do you have dataset file? Can you send it to me? (Thank you if you can) 请问你有训练数据集吗?能分享一份给我吗?非常感谢 @TrinhQuocNguyen @manumathewthomas @Toni-Chan @phaniavi @coco1549134149 @stefenmax @qiongshuai @fourteen14fourteen @firdameng @wish829 @Tian14267

SolicTous commented 4 years ago

I can do my own dataset, but i don't know what are metric images. And other details about prepare dataset to train. How do it collegues?

Susan3333 commented 4 years ago

我可以创建自己的数据集,但我不知道什么是公制图像。有关准备数据集进行训练的其他详细信息。 同事如何?

您创建的数据集的图像大小是多少呢?

SolicTous commented 4 years ago

you can try it that adding reuse=reuse in function conv_layer with tf.variable_scope(scope_name, reuse=reuse):

@firdameng I have the same problem with reuse... Thank you - I have solved it by adding this.

我可以创建自己的数据集,但我不知道什么是公制图像。有关准备数据集进行训练的其他详细信息。 同事如何?

您创建的数据集的图像大小是多少呢?

@Susan3333 Dataset is reshaped to 256x512 Now I have problems to understand how to prepare dataset. What is the name of images for train and validation I must create? And about validation - Is it original images? I have tried 'gauss + 2092 + A', I'm not sure. Can anybody who train that say me structure with for dataset? Where must be grountruth images? Now I have another error. Why in code I see padding? For what is it? npad = ((0, 0), (56, 56), (0, 0), (0, 0)) validation = np.pad(validation, pad_width=npad, mode='constant', constant_values=0)

And this? image = np.resize(image[7][56:, :, :], [144, 256, 3])

ValueError: Cannot feed value of shape (1, 368, 512, 3) for Tensor 'generated_image:0', which has shape '(?, 256, 256, 3)'

Susan3333 commented 4 years ago

thank you for your reply

At 2020-09-17 13:33:59, "Marsel Iamaev" notifications@github.com wrote:

you can try it that adding reuse=reuse in function conv_layer with tf.variable_scope(scope_name, reuse=reuse):

@firdameng I have the same problem with reuse... Thank you - I have solved it by adding this.

我可以创建自己的数据集,但我不知道什么是公制图像。有关准备数据集进行训练的其他详细信息。 同事如何?

您创建的数据集的图像大小是多少呢?

@Susan3333 Now i have problems to understand how to prepare dataset. What is the name of images for train and validation I must create? And about validation - Is it original images? I have tried 'gauss + 2092 + A', I'm not sure. Can anybody who train that say me structure with for dataset? Where must be grountruth images?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

SolicTous commented 4 years ago

So I started this. But now I have two problems with several moments. 1) GraphDef cannot be larger than 2GB. (Training will broke after several iterations) tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2693993471bytes) would be larger than the limit (2147483647 bytes)

2) Gen Loss: nan (Training will broke after several iterations)