Questions about training

SorourMo commented 3 years ago

Hi, Thank you for sharing your code on GitHub and congratulations on your TGRS 2020 paper (it's a great piece of work). I have two questions about the training phase of the DDCM model. I would really appreciate it if you could help me find their answers:

1- Can anyone please tell me when the training process ends in train.py? I have read both the paper and code and have not been able to find a hint on this matter. I am referring specifically to training on the Vaihingen dataset. There appears to be a continuous reduction in the lr, but I am unsure when this reduction is stopped since there is no maximum number of epochs, minimum lr, etc. defined in the configurations. There is only one restriction defined: the maximum number of iterations is 10e8. Does this mean that the training should continue for 10e8 iterations? Since each iteration lasts for 1 minute on my computer (batch size =5), 10e8 iterations will take forever!

2- I ran your code with the Vaihingen dataset for approximately 16 epochs (16k iterations) and the following are the training and validation loss trends.

git1

As can be seen in the figure, the main_loss is reduced (with a few abrupt steps) from 1.5 to 0.157 (black box in the middle of the figure belongs to main_loss at 16k iteration). However, the val_loss keeps fluctuating between 0.40 and 0.47 instead of decreasing, e.g., in epoch=16, main_loss equals 0.157, while val_loss equals 0.47. So, according to the figure, training loss decreases during training, but validation loss does not. This means that we are dealing with a large gap between the train and validation losses (overfitting). Have you observed this behavior in your trainings? Do you have any specific suggestions, solutions, or comments to fix the overfitting problem?

samleoqh commented 3 years ago

Hi,

1) you can manually stop the training (ctr+c), I did so when I observe there is no better val-acc / m-F1 (e.g, 20-epochs), or you can simply change the code to stop the training. Normally I will train the model about Maxum 300-epochs, then manually stop it.

2) it often happens (val-loss goes higher and higher), but you can still continue the training until you see the val-acc/miou/f1-score stopped increasing for a long time, then you can stop the training. Val-loss in this case is not so important, because the validation samples are always changed (since they are randomly cropped from a few big val- images).

best.

SorourMo commented 3 years ago

Thank you for your explanation.

you can manually stop the training (ctr+c), I did so when I observe there is no better val-acc / m-F1 (e.g, 20-epochs), or you can simply change the code to stop the training. Normally I will train the model about Maxum 300-epochs, then manually stop it.

Yeah, it'd be easier to add a maximum number of epochs to the training code.

Val-loss in this case is not so important, because the validation samples are always changed (since they are randomly cropped from a few big val- images).

The validation patches are selected randomly. However, we initialize all the random number generators with a fixed seed at the beginning of the training like below:

def main():
    random_seed(train_args.seeds)
    net = load_model(name=train_args.model, classes=train_args.nb_classes, load_weights=False, skipp_layer=None).cuda()

    net, start_epoch = train_args.resume_train(net)
    net.train()
   .
   .

Since the exact same patches were cropped for the calculations of the validation loss, I believe the values of validation loss calculated at the end of each epoch are comparable. Even if the patches changed during the validation phase, the patches were still cropped from two known tiles (7 and 28) and the average loss over these two tiles should decline during the training phase.

Anyway, I think that if the overfitting problem is solved, you will be able to get even better results than the ones reported in the paper!

samleoqh commented 3 years ago

Hi, thank you for your valuable comments. I also agree with your point on the val loss and overfitting issue, I did not pay much attention to it though I often observed it before -:) There are a lot of rooms for improvement in this work. I'll rethink about it and welcome any suggestions/discussions.

samleoqh / DDCM-Semantic-Segmentation-PyTorch

Questions about training #2