Closed FrugoFruit90 closed 4 years ago
I've never see this before, and seems awkward to me. The error is raised while computing the FID score. Especially the generator loss is finite, which indicates the problem is neither caused by gradient explosion nor numerical problems during training.
Could you run the experiment again to see if it happens again? Note that our codes should be able to automatically recover from the latest checkpoint of your experiment.
Thank you for your answer, how can I enable running from checkpoint? Or is it done automatically?
On Mon, 27 Jul 2020 at 12:09, HubertLin notifications@github.com wrote:
I've never see this before, and seems awkward to me. The error is raised while computing the FID score. Especially the generator loss is finite, which indicates the problem is neither caused by gradient explosion nor numerical problems during training.
Could you run the experiment again to see if it happens again? Note that our codes should be able to automatically recover from the latest checkpoint of your experiment.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hubert0527/COCO-GAN/issues/13#issuecomment-664257803, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA5UZRP4VUIWONYE754NWLR5VG6XANCNFSM4PIQNLQA .
Automatically, just run the same training command. The messages on the console will tell you if it finds and recovers from a checkpoint.
Seems to have worked, I'm on epoch 55 and >132000 global step. Guess it was a kind of anomaly.
On Mon, 27 Jul 2020 at 12:17, HubertLin notifications@github.com wrote:
Automatically, just run the same training command. The messages on the console will tell you if it finds and recovers from a checkpoint.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hubert0527/COCO-GAN/issues/13#issuecomment-664266882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA5UZRWAXDVO75F67VI3GDR5VH3BANCNFSM4PIQNLQA .
Using python 3.6.9 and packages as advised in README, I tried to train from scratch with:
I didn't set up the number of epochs as advised in the yaml. The training started and continued for quite some time, the tensorboard looks fine I think?
However, after exactly 130000 steps there was an error, the traceback of which I post below. Any idea why this happened?