rosinality / style-based-gan-pytorch

Implementation A Style-Based Generator Architecture for Generative Adversarial Networks in PyTorch
Other
1.1k stars 232 forks source link

KeyError: 'generator' when continuing from checkpoint #56

Open Ghostbeach opened 5 years ago

Ghostbeach commented 5 years ago

Hey there, first, thank you for your amazing work with this pytorch stylegan, i got it to work quite flawlessly. I trained on a quite small custom dataset on the free gpu on google colab for several hours. Now i got two .model checkpoints saved in the docked google drive. However when im trying to continue training from the checkpoint "020000.model" im getting a KeyError: 'generator'. Help would be really appreciated since im searching around for a fix for quite some time now.

See output:

!python ./train.py --ckpt ./checkpoint/020000.model ./datasets/custom

Traceback (most recent call last): File "./train.py", line 316, in generator.module.load_state_dict(ckpt['generator']) KeyError: 'generator'

thank you in advance!

rosinality commented 5 years ago

Sorry, [DIGITS].model file only saves running average of generator. Generators and discriminators are saved in the train-step-X.model file when train phase changed. Could you use that checkpoints?

Ghostbeach commented 5 years ago

Thank you for your quick answer. So the problem probably is, there is no train-step-X.model existing, because the process wasnt far enough executed to make the jump to 16x resolution, where it would be safed as far as i understand. Is there a easy way to modify the code, that training stops get safed more often than only on resolution jumps?

Ghostbeach commented 5 years ago

Or maybe i'll just need to get better hardware ( :

rosinality commented 5 years ago

Hmm I thought 20000 iterations are enough to change phases. You can copy these lines https://github.com/rosinality/style-based-gan-pytorch/blob/master/train.py#L100 to https://github.com/rosinality/style-based-gan-pytorch/blob/master/train.py#L241 this to save states for resume training. Sorry for confusion.

MohammedAlghamdi commented 4 years ago

Thank you for sharing your code.

I think also you need to reset --init_size to where it stopped training when resuming training otherwise it would restart from the 8 by 8.