Discriminator Loss converges to 0 while Generator loss pretty high

demiahmed commented 2 years ago

I am trying to train with a custom image dataset for about 600,000 epochs. At about halfway, my D_loss converges to 0 while my G_loss stays put at 2.5

My evaluation outputs are slowly starting to fade out to either black or white.

Is there any thing that I could to tweak my model? Either by increasing the threshold for the Discriminator or by training the Generator only?

iScriptLex commented 2 years ago

This is some kind of gradient vanishing in GAN. It means that generator has reached its limit on your dataset and begins to rearrange its capacity by dropping some rare nodes. So, with each iteration generator's output will lose more and more diversity. Like this: 2022-06-28-164449_398x331_scrot Technically output images are not identical, but they look too similar and contain only few dataset features.

It could mean that your dataset is too complicated, unbalanced or just too small.

There are several ways to deal with it.

Improve your dataset: add more images, remove outliers which differ too much from the most of pictures, etc.
Reduce learning rate: --learning-rate 1e-4 or even --learning-rate 1e-5 (of course it should be reduced not from the start of training, but only when your discriminator loss drops too much).
Continue your training with increased batch size: --batch-size 64 If you don't have enough VRAM for that, use gradient accumulation with your original batch size: --gradient-accumulate-every 2
Use TTUR. This GAN contains code for working with it, but for some reason it is not present in the list of input parameters. So, you should modify cli.py for that.

In cli.py, after line def train_from_folder( add to parameter list: ttur_mult = 1.0, and after model_args = dict( add ttur_mult = ttur_mult, to this dict.

Then, use it like this: --ttur-mult 2.0

Add more augmentation: --aug-prob 0.6 or even --aug_prob 0.8

Other methods greatly depend on your dataset and require code modifications (such as some kinds of regularizations during the training process).

demiahmed commented 2 years ago

Thanks for all the suggestions. I am trying out a combination of all measures.

My default --gradient-accumulate-every is 4. Does higher gradient accumulation imitate a larger batch size?

I'm using an RTX 3080 with 10GB of VRAM and with a dataset size of 4.3k images and hence I can't up my batch-size beyond 8

iScriptLex commented 2 years ago

Does higher gradient accumulation imitate a larger batch size?

Yes, it does. You can set --gradient-accumulate-every 8 or even more.

lucidrains / lightweight-gan

Discriminator Loss converges to 0 while Generator loss pretty high #133