google / compare_gan

Compare GAN code.
Apache License 2.0
1.82k stars 317 forks source link

Error "Retval[7] does not have value" when training SSGAN #43

Open hankook opened 4 years ago

hankook commented 4 years ago

My current tensorflow, cuda and cudnn are 1.13.2, 10.0 and 7.6.5, respectively. I also tried other versions (1.14 and 1.15 for tensorflow), but I got same error messages. Details are described below.

When training SSGAN, I used the following gin configuration, which is slightly modified from examples/resnet_cifar10.gin:

dataset.name = "cifar10"
options.architecture = "resnet_cifar_arch"
options.batch_size = 64
options.gan_class = @SSGAN
options.lamba = 1
options.training_steps = 40000
options.z_dim = 128

# Generator
G.batch_norm_fn = @batch_norm
standardize_batch.decay = 0.9
standardize_batch.epsilon = 1e-5

# Discriminator
options.disc_iters = 5
D.spectral_norm = True

# Loss and optimizer
loss.fn = @non_saturating
penalty.fn = @no_penalty
SSGAN.g_lr = 0.0002
SSGAN.g_optimizer_fn = @tf.train.AdamOptimizer
SSGAN.rotated_batch_size = 64
tf.train.AdamOptimizer.beta1 = 0.5
tf.train.AdamOptimizer.beta2 = 0.999

Then, the below error message was occurred:

Traceback (most recent call last):
  File "main.py", line 133, in <module>                                                                                                                                                                                            [24/1911]
    app.run(main)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "main.py", line 127, in main
    eval_every_steps=FLAGS.eval_every_steps)
  File "/home/hankook/Codes/compare_gan/compare_gan/runner_lib.py", line 337, in run_with_schedule
    hooks=train_hooks)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train
    rendezvous.raise_errors()
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
    saving_listeners=saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/hankook/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[7] does not have value

When using examples/resnet_cifar10.gin, the training code was working successfully. How to fix this issue? Is there any gin configuration examples for SSGAN?

zengsn commented 4 years ago

Yes. I got the same error. I found that when setting options.disc_iters>1, the issue occurs, no matter what type of GAN architecture is used.

I tried to debug it but got no workaround so far. Could you pls help for this?

@Marvin182

zengsn commented 4 years ago

After debug, I found that we need to add one more setting if training on GPU, instead of TPU.

ModularGAN.experimental_force_graph_unroll=True
options.disc_iters = 2  # if > 1

But as the code suggesting, make sure your GPU has big enough memory.

Welcome to discuss.

czzerone commented 3 years ago

@zengsn @hankook hi, I got the same error, have you solve this problem?