The Problem of training Edge model

Haoyanlong commented 5 years ago

Hello, I meet some trouble when training the edge model as follows use the code parameters('LR':0.0001, 'D2G_LR':0.1), and I dont' understand the difference of LR and D2G_LR. And the gen_loss has been turbulent.

For the edge image, the above is the output of edge model, and the down is the groundtruth. I think the D2G_LR is too big, could you help me?

knazeri commented 5 years ago

As for the oscillation in the loss, this is expected behavior with GAN models, but your generator loss is way too high! A few notes to consider:

judging by your posted image, you are combining different mask MASK:5 which we used only for experiments. I suggest to use only irregular mask MASK:3 to get better results and faster convergence.
You can reuse the weights we posted to bootsrap the training. It takes a very long time (more than 2,000,000 iterartions) for the model to converge for natural images dataset. I see you are at 17, 000 iteration! Use pre-train weights and let me know the generdator loss value!
If you want to train the model from scratch, I suggest to start with smaller image sizes. You can set INPUT_SIZE:128 in the configuration and see if the model is converging. You can reuse the weights when moving to larger input images.

The D2G_LR flag determines the ratio of the discriminator's learning rate with respect to that of generdator. For example if your base LR is 0.0001 then your discriminator's LR becomes 0.00001.

Haoyanlong commented 5 years ago

@knazeri ,OK, I See!Thank you very much!

Haoyanlong commented 5 years ago

@knazeri ，I have loaded the model pretraind in Place2. I keep the parameters no change in config.yam, and I set the Mask:3. And the loss visualization as follows:

And the gen_loss has been oscillating and slowly increasing.Could you help me ?Thank you very much!

cmyyy commented 5 years ago

@Haoyanlong Hello,could u tell me how to do visualization? Thanks a lot!

Haoyanlong commented 5 years ago

@cmyyy ,I use the tensorboardX for visualization and you can install it and learn from https://github.com/lanpa/tensorboardX.

knazeri commented 5 years ago

@Haoyanlong Your generator loss is still diverging, which could be because the learning rate is too large. During training, we scaled down the learning rate. The final learning rate we trained the model with was 1e-6, any value larger than that can make the trained model diverge! Also, please note that there was an error in the default config.yml file regarding the style loss value, which was fixed here: https://github.com/knazeri/edge-connect/issues/36

I'm reopening the issue, I was not being notified of the comments .

LuckyHeart commented 5 years ago

@knazeri Thanks for sharing your code with us. Excellent work! However, I meet some trouble when training the edge model like @Haoyanlong When I training the edge model, I found the generator loss slowly increasing, is the result right? I trained the model with the learning rate of 0.000001 and 0.0001,

I only change the INPUT_SIZE:128; MASK:3 The dataset what I use is CeleBa, #################################config.yml################### MODE: 1 # 1: train, 2: test, 3: eval MODEL: 1 # 1: edge model, 2: inpaint model, 3: edge-inpaint model, 4: joint model MASK: 3 # 1: random block, 2: half, 3: external, 4: (external, random block), 5: (external, random block, half) EDGE: 1 # 1: canny, 2: external NMS: 1 # 0: no non-max-suppression, 1: applies non-max-suppression on the external edges by multiplying by Canny SEED: 10 # random seed GPU: [0] # list of gpu ids DEBUG: 0 # turns on debugging mode VERBOSE: 0 # turns on verbose mode in the output console

TRAIN_FLIST: ./datasets/celeba_train.flist VAL_FLIST: ./datasets/celeba_val.flist TEST_FLIST: ./datasets/celeba_test.flist

TRAIN_EDGE_FLIST: VAL_EDGE_FLIST: TEST_EDGE_FLIST:

TRAIN_MASK_FLIST: ./datasets/masks_train.flist VAL_MASK_FLIST: ./datasets/masks_val.flist TEST_MASK_FLIST: ./datasets/masks_test.flist

LR: 0.0001 # learning rate D2G_LR: 0.1 # discriminator/generator learning rate ratio BETA1: 0.0 # adam optimizer beta1 BETA2: 0.9 # adam optimizer beta2 BATCH_SIZE: 8 # input batch size for training INPUT_SIZE: 128 # input image size for training 0 for original size SIGMA: 2 # standard deviation of the Gaussian filter used in Canny edge detector (0: random, -1: no edge) MAX_ITERS: 2e6 # maximum number of iterations to train the model

EDGE_THRESHOLD: 0.5 # edge detection threshold L1_LOSS_WEIGHT: 1 # l1 loss weight FM_LOSS_WEIGHT: 10 # feature-matching loss weight STYLE_LOSS_WEIGHT: 250 # style loss weight CONTENT_LOSS_WEIGHT: 0.1 # perceptual loss weight INPAINT_ADV_LOSS_WEIGHT: 0.1 # adversarial loss weight

GAN_LOSS: nsgan # nsgan | lsgan | hinge GAN_POOL_SIZE: 0 # fake images pool size

SAVE_INTERVAL: 1000 # how many iterations to wait before saving model (0: never) SAMPLE_INTERVAL: 1000 # how many iterations to wait before sampling (0: never) SAMPLE_SIZE: 12 # number of images to sample EVAL_INTERVAL: 0 # how many iterations to wait before model evaluation (0: never) LOG_INTERVAL: 10 # how many iterations to wait before logging training status (0: never)

knazeri commented 5 years ago

@LuckyHeart Thank you for your interest and attention to detail. This is an expected behavior of adversarial loss. Normally when training neural network we expect the loss to monotonically decrease. Of course, that's true when we have a fixed well-defined loss term. In case of an adversarial loss, the loss itself is a neural network and the optimization is performed in a zero-sum game between the generator and discriminator. In an ideal world, we prefer this loss to remain constant, meaning that the generdator and the discriminator are learning at the same pace.

However, in practice, these networks are high dimensional, non-convex, non-cooperative functions, and the balance between two players cannot be guaranteed. That being said, a very mild increase in generator loss almost at the end of training is acceptable. That means the generator is learning, but not as fast as the discriminator. And the increase in the loss essentially means that either the discriminator is learning faster, or the generator has reached its limit.

This is part of the reason training GANs are difficult. Also, keep in mind that just monitoring the loss is not always working for GANs models, as you should always look at the samples to measure the qualitative performance of the model.

LuckyHeart commented 5 years ago

@knazeri Wow! Thanks to your reply. l learned a lot!

knazeri / edge-connect

The Problem of training Edge model #33