Open SorourMo opened 3 years ago
Hi,
1) you can manually stop the training (ctr+c), I did so when I observe there is no better val-acc / m-F1 (e.g, 20-epochs), or you can simply change the code to stop the training. Normally I will train the model about Maxum 300-epochs, then manually stop it.
2) it often happens (val-loss goes higher and higher), but you can still continue the training until you see the val-acc/miou/f1-score stopped increasing for a long time, then you can stop the training. Val-loss in this case is not so important, because the validation samples are always changed (since they are randomly cropped from a few big val- images).
best.
Thank you for your explanation.
you can manually stop the training (ctr+c), I did so when I observe there is no better val-acc / m-F1 (e.g, 20-epochs), or you can simply change the code to stop the training. Normally I will train the model about Maxum 300-epochs, then manually stop it.
Yeah, it'd be easier to add a maximum number of epochs to the training code.
Val-loss in this case is not so important, because the validation samples are always changed (since they are randomly cropped from a few big val- images).
The validation patches are selected randomly. However, we initialize all the random number generators with a fixed seed at the beginning of the training like below:
def main():
random_seed(train_args.seeds)
net = load_model(name=train_args.model, classes=train_args.nb_classes, load_weights=False, skipp_layer=None).cuda()
net, start_epoch = train_args.resume_train(net)
net.train()
.
.
Since the exact same patches were cropped for the calculations of the validation loss, I believe the values of validation loss calculated at the end of each epoch are comparable. Even if the patches changed during the validation phase, the patches were still cropped from two known tiles (7 and 28) and the average loss over these two tiles should decline during the training phase.
Anyway, I think that if the overfitting problem is solved, you will be able to get even better results than the ones reported in the paper!
Hi, thank you for your valuable comments. I also agree with your point on the val loss and overfitting issue, I did not pay much attention to it though I often observed it before -:) There are a lot of rooms for improvement in this work. I'll rethink about it and welcome any suggestions/discussions.
Hi, Thank you for sharing your code on GitHub and congratulations on your TGRS 2020 paper (it's a great piece of work). I have two questions about the training phase of the DDCM model. I would really appreciate it if you could help me find their answers:
1- Can anyone please tell me when the training process ends in
train.py
? I have read both the paper and code and have not been able to find a hint on this matter. I am referring specifically to training on the Vaihingen dataset. There appears to be a continuous reduction in thelr
, but I am unsure when this reduction is stopped since there is no maximum number ofepoch
s, minimumlr
, etc. defined in the configurations. There is only one restriction defined: the maximum number of iterations is10e8
. Does this mean that the training should continue for10e8
iterations? Since each iteration lasts for 1 minute on my computer (batch size =5
),10e8
iterations will take forever!2- I ran your code with the Vaihingen dataset for approximately 16 epochs (16k iterations) and the following are the training and validation loss trends.
As can be seen in the figure, the
main_loss
is reduced (with a few abrupt steps) from 1.5 to 0.157 (black box in the middle of the figure belongs tomain_loss
at 16k iteration). However, theval_loss
keeps fluctuating between 0.40 and 0.47 instead of decreasing, e.g., inepoch=16
,main_loss
equals0.157
, whileval_loss
equals0.47
. So, according to the figure, training loss decreases during training, but validation loss does not. This means that we are dealing with a large gap between the train and validation losses (overfitting). Have you observed this behavior in your trainings? Do you have any specific suggestions, solutions, or comments to fix the overfitting problem?