NAN in training process

taohan10200 / IIM

PyTorch implementations of the paper: "Learning Independent Instance Maps for Crowd Localization"

MIT License

163 stars 39 forks source link

NAN in training process #7

Open Nikumata opened 3 years ago

Nikumata commented 3 years ago

Hi, when I training the network of NWPU dataset, the results indicates NAN in all following cases. I set the training batch size to 6 for preventing out of memory.

taohan10200 commented 3 years ago

You can lower the learning rate of the threshold encoder in config.py, such as 1e-7.

if __C.OPT == 'Adam':
    __C.LR_BASE_NET = 1e-5  # learning rate
    __C.LR_BM_NET =  1e-7    #1e-6  # learning rate'

Thanks for your attention!

Nikumata commented 3 years ago

adjust the learning rate does works! Thanks for your reply.

Nikumata commented 3 years ago

Hi taohan@taohan10200 , after lowing the learning rate, NAN still appeared after 87 iterations. I saved the model and weights every 20 iterations, and felt amazed that based on 80th model, the model can be trained normally without NAN. Do you have any good suggestions?

By the way, there is no read_pred_and_gt module in misc.utils.py, causes vis4val.py cannot work properly, would you please commit this part codes？Thanks。

taohan10200 commented 3 years ago

In our training, NaN would appear even if we lowered the threshold some times. At this time, we usually lower the threshold again to avoid this problem. We recommend using the experimental configuration we provide under folder saved_exp_results. In general, it may be the inverse gradients that make the module's training is instability. We have tried to solve this problem by optimizing the threshold learner, but it is still in testing, and we will update the new solution in the future.

We have updated the read_pred_and_gt module in misc.utils.py.

Thanks~

henbucuoshanghai commented 3 years ago

where is the path of the save model?