open-mmlab / mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
https://mmagic.readthedocs.io/en/latest/
Apache License 2.0
6.96k stars 1.07k forks source link

Too big loss after few iterations #435

Closed cs20162004 closed 3 years ago

cs20162004 commented 3 years ago

Hello. Thank you for your work!

I am training GLEAN for 256x256 input size and 1024x1024 output size using the same configuration (i.e. glean_ffhq_x16.py) provided by you. I only changed the input size, scaling factor and data path. However, after few iteration my loss becomes too big and decrease again. Like this: Screenshot from 2021-07-16 09-26-10

My question is: 1) Is it because of the Adam optimizer i.e. before converging it shakes little bit? And, can decreasing learning rate solve this issue? 2) loss_d_real and loss_d_fake are 0.0000 at some iterations, what could this mean? Thank you for your time!

ckkelvinchan commented 3 years ago

The training is corrupted.

  1. Yes you can try decreasing the learning rate. Also, you may need to adjust the loss weights. The current settings may not be optimal for 4x SR.
  2. When the training is corrupted, the generator produces meaningless outputs and the discriminator can easily distinguish "real" and "fake". Hence the loss is 0 here.
cs20162004 commented 3 years ago

Hello. Thank you for your reply! I have trained GLEAN on ffhq dataset with upscale factor = 4 for 250,000 iterations. However, the output result still don't look natural. My output looks like the following: 0 0

I decreased the learning rate for generator and discriminator by 10, after which the training went stable and the validation PSNR started increased. But I didn't change the loss weights, because I am not sure how to correctly choose them. Do you have any suggestion based on this output? My config file in case if you need: glean_ffhq_4x.txt

ckkelvinchan commented 3 years ago

May I ask whether you are using the same downsampling method for training and test images?

cs20162004 commented 3 years ago

Thank you for your quick reply!

I use torch.nn.functional.interpolate()for downsampling, by default it uses nearest mode.

EDIT: Yes the same downsampling method for training and validation images.

ckkelvinchan commented 3 years ago

Since your learning rate is smaller, you may want to train it for longer, say 600k. You can observe the change of the loss curve and see whether it has converged or not.

cs20162004 commented 3 years ago

Hello @ckkelvinchan . Thank you for your reply!

I am now doing that. I looked at the loss curve in my training and the log that you provided for ffhq training dataset and, until now, they look similar. I was wodering how did you choose 300k iterations. Did you choose it based on your training loss curve? If yes then which loss (pix, perceptual or gan). Because the validation PSNR value doesn't seem to improve much. Thank you!

ckkelvinchan commented 3 years ago

Hello @ckkelvinchan . Thank you for your reply!

I am now doing that. I looked at the loss curve in my training and the log that you provided for ffhq training dataset and, until now, they look similar. I was wodering how did you choose 300k iterations. Did you choose it based on your training loss curve? If yes then which loss (pix, perceptual or gan). Because the validation PSNR value doesn't seem to improve much. Thank you!

The loss weights are training schemes were not carefully tuned. I think there is a chance that the model is not converged. I think you can continue observing the PSNR value to see the performance will eventually get better.

cs20162004 commented 3 years ago

I think you are already aware that the torch.nn.functional.interpolate() bicubic method looks very similar to nearest mode. After changing the LR dataset from pytorch bicubic method to matlab bicubic method, validation PSNR increased by ~3.0 from the first iteration. And after training for 60,000 iterations with 5e-5 learning rate for both generator and discriminator, I got similar SR images as the paper. Thank you for your replies!

ckkelvinchan commented 3 years ago

The bicubic mode in MATLAB has anti-aliasing, and therefore the LR image would have a better quality when downsampled by the MATLAB bicubic method. It is normal that you have a lower PSNR when your LR is downsampled by nearest downsampling.

Anyway, it is good that similar PSNR can be achieved :)