A question about what's happening in training

ramdrop / stun

Implementation for the paper: STUN: Self-Teaching Uncertainty Estimation for Place Recognition

BSD 3-Clause "New" or "Revised" License

30 stars 3 forks source link

A question about what's happening in training #8

Closed shenyehui closed 12 months ago

shenyehui commented 1 year ago

Great work! I have been using your code, and I noticed that the best results are achieved around the third or fourth epoch during training. Is this because you have some special settings for training?

shenyehui commented 1 year ago

How do I set it to a larger epoch when 60 epochs are not enough?Thank you for your answer

shenyehui commented 1 year ago

I changed the line self.parser.add_argument('--nEpochs', type=int, default=100, help='number of epochs to train for') but it still stops with 60 epochs

ramdrop commented 1 year ago

Thanks.

I have been using your code, and I noticed that the best results are achieved around the third or fourth epoch during training. Is this because you have some special settings for training?

No, it depends on the dataset. So I use the val set to locate the best epoch.

How do I set it to a larger epoch when 60 epochs are not enough?Thank you for your answer

Use --nEpochs to pass a customized epoch.

I changed the line self.parser.add_argument('--nEpochs', type=int, default=100, help='number of epochs to train for') but it still stops with 60 epochs.

When the performance has not improved for --patience epochs, the training will terminate.

shenyehui commented 1 year ago

Thank you. I changed the line self.parser.add_argument('--nEpochs', type=int, default=100, help='number of epochs to train for') and performance improves at 57th epoch, but it still stops at 60th epoch

ramdrop commented 1 year ago

what is your command?

shenyehui commented 1 year ago

what is your command?

my command: python main.py --phase=train_stu --resume=logs/tri_train_tea_0714_005535/ckpt_best.pth.tar

ramdrop commented 1 year ago

It is because the--nEpochs is overode by the ckpt. You may want to do a hack by adding options.nEpochs=100 after https://github.com/ramdrop/stun/blob/bda3537fcd3562c8f9f3e9a8e104789d960b48bd/main.py#L13

shenyehui commented 1 year ago

It is because the--nEpochs is overode by the ckpt. You may want to do a hack by adding options.nEpochs=100 after

https://github.com/ramdrop/stun/blob/bda3537fcd3562c8f9f3e9a8e104789d960b48bd/main.py#L13

Thanks for your patience, I'll try the method you suggested later on

shenyehui commented 1 year ago

I'm sorry, I've encountered another issue. While training the student network, I noticed that the loss function is printed as tensor(nan, device='cuda:0', grad_fn=). Could this be due to me changing self.whole_training_data_loader to self.training_data_loader?

ramdrop commented 1 year ago

Yes, self.whole_training_data_loader and self.training_data_loader cannot be used interchangeably.

shenyehui commented 1 year ago

Yes, self.whole_training_data_loader and self.training_data_loader cannot be used interchangeably.

Thank you,But I see that you used the output of the teacher model as the target when training the student model, but I think using the data from training the teacher network together could improve the results,so this question arises