Infinite activation while training on 64x64 CelebA

FrugoFruit90 commented 4 years ago

Using python 3.6.9 and packages as advised in README, I tried to train from scratch with:

python ./scripts/compute_tfrecord.py --dataset celeba --resolution 64
python ./fid_utils/precalc_fid_stats.py --dataset celeba --data_path "./data/CelebA/*" --resolution=64
python main.py --config="./configs/CelebA_64x64_N2M2S32.yaml"

I didn't set up the number of epochs as advised in the yaml. The training started and continued for quite some time, the tensorboard looks fine I think?

However, after exactly 130000 steps there was an error, the traceback of which I post below. Any idea why this happened?

[CelebA_64x64_N2M2S32] [Epoch: 54; 1261/2384; global_step:129997] elapsed: 64138.1083, d: -1490.8943, g: -139648.8750, q: 0.0000
[CelebA_64x64_N2M2S32] [Epoch: 54; 1262/2384; global_step:129998] elapsed: 64138.4170, d: 151.7455, g: -136700.0781, q: 0.0000
[CelebA_64x64_N2M2S32] [Epoch: 54; 1263/2384; global_step:129999] elapsed: 64138.7237, d: 741.5792, g: -140252.2188, q: 0.0000
[CelebA_64x64_N2M2S32] [Epoch: 54; 1264/2384; global_step:130000] elapsed: 64139.0329, d: -1249.9255, g: -138572.9062, q: 0.0000
 62%|█████████████████████████████████████████████████████████████████▉                                         | 482/782 [09:07<05:41,  1.14s/it]2020-07-27 08:41:02.094112: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f19ed960100 = {1, 0} activation input is not finite.
 62%|█████████████████████████████████████████████████████████████████▉                                         | 482/782 [09:08<05:41,  1.14s/it]
2020-07-27 08:41:02.107534: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.107596: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.107654: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.107891: W tensorflow/core/kernels/queue_base.cc:277] _0_input_producer: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108087: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108192: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108304: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108340: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108361: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108384: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108420: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108443: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108464: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108487: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108511: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108532: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
2020-07-27 08:41:02.108555: W tensorflow/core/kernels/queue_base.cc:277] _1_shuffle_batch/random_shuffle_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: activation input is not finite. : Tensor had NaN values
     [[{{node FID_Inception_Net/mixed_4/tower/conv_2/CheckNumerics}}]]
     [[{{node FID_Inception_Net/pool_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 131, in <module>
    trainer.train(logger, evaluator, global_step)
  File "/home/janek/Documents/COCO-GAN/trainer.py", line 419, in train
    z_iter, z_fixed, feed_dict_iter, feed_dict_fixed)
  File "/home/janek/Documents/COCO-GAN/logger.py", line 207, in log_iter
    cur_fid = evaluator.evaluate(trainer)
  File "/home/janek/Documents/COCO-GAN/evaluator.py", line 71, in evaluate
    batch_features = fid.get_activations(gen_full_images, self.sess, self.batch_size)
  File "/home/janek/Documents/COCO-GAN/fid_utils/fid.py", line 125, in get_activations
    pred = sess.run(inception_layer, {'FID_Inception_Net/ExpandDims:0': batch})
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: activation input is not finite. : Tensor had NaN values
     [[node FID_Inception_Net/mixed_4/tower/conv_2/CheckNumerics (defined at /home/janek/Documents/COCO-GAN/fid_utils/fid.py:45) ]]
     [[node FID_Inception_Net/pool_3 (defined at /home/janek/Documents/COCO-GAN/fid_utils/fid.py:45) ]]

Caused by op 'FID_Inception_Net/mixed_4/tower/conv_2/CheckNumerics', defined at:
  File "main.py", line 115, in <module>
    evaluator.build_graph()
  File "/home/janek/Documents/COCO-GAN/evaluator.py", line 45, in build_graph
    fid.create_inception_graph(inception_path)
  File "/home/janek/Documents/COCO-GAN/fid_utils/fid.py", line 45, in create_inception_graph
    _ = tf.import_graph_def( graph_def, name='FID_Inception_Net')
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 235, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3433, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3433, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3325, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/home/janek/Documents/COCO-GAN/coco_venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): activation input is not finite. : Tensor had NaN values
     [[node FID_Inception_Net/mixed_4/tower/conv_2/CheckNumerics (defined at /home/janek/Documents/COCO-GAN/fid_utils/fid.py:45) ]]
     [[node FID_Inception_Net/pool_3 (defined at /home/janek/Documents/COCO-GAN/fid_utils/fid.py:45) ]]

hubert0527 commented 4 years ago

I've never see this before, and seems awkward to me. The error is raised while computing the FID score. Especially the generator loss is finite, which indicates the problem is neither caused by gradient explosion nor numerical problems during training.

Could you run the experiment again to see if it happens again? Note that our codes should be able to automatically recover from the latest checkpoint of your experiment.

FrugoFruit90 commented 4 years ago

Thank you for your answer, how can I enable running from checkpoint? Or is it done automatically?

On Mon, 27 Jul 2020 at 12:09, HubertLin notifications@github.com wrote:

I've never see this before, and seems awkward to me. The error is raised while computing the FID score. Especially the generator loss is finite, which indicates the problem is neither caused by gradient explosion nor numerical problems during training.

Could you run the experiment again to see if it happens again? Note that our codes should be able to automatically recover from the latest checkpoint of your experiment.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hubert0527/COCO-GAN/issues/13#issuecomment-664257803, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA5UZRP4VUIWONYE754NWLR5VG6XANCNFSM4PIQNLQA .

hubert0527 commented 4 years ago

Automatically, just run the same training command. The messages on the console will tell you if it finds and recovers from a checkpoint.

FrugoFruit90 commented 4 years ago

Seems to have worked, I'm on epoch 55 and >132000 global step. Guess it was a kind of anomaly.

On Mon, 27 Jul 2020 at 12:17, HubertLin notifications@github.com wrote:

Automatically, just run the same training command. The messages on the console will tell you if it finds and recovers from a checkpoint.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hubert0527/COCO-GAN/issues/13#issuecomment-664266882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA5UZRWAXDVO75F67VI3GDR5VH3BANCNFSM4PIQNLQA .

hubert0527 / COCO-GAN

Infinite activation while training on 64x64 CelebA #13