loss = nan..what's the problem?

peteryuX / arcface-tf2

ArcFace unofficial Implemented in Tensorflow 2.0+ (ResNet50, MobileNetV2). "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" Published in CVPR 2019. With Colab.

MIT License

263 stars 60 forks source link

loss = nan..what's the problem? #9

Closed shiney5213 closed 4 years ago

shiney5213 commented 4 years ago

I am training model with ms1m_dataset and asian seleb dataset but loss = Non... Model is not tranied at all. mode = 'fit' -> loss = non mode = 'eager_ft' -> loss = non mode = 'eager_fit' -> Out Of memory Error what's the problem? please help me and thank you...have a nice day

peteryuX commented 4 years ago

Hi, @shiney5213. How did you prepare these two datasets?

shiney5213 commented 4 years ago

Hi @peteryuX

ms1m_align_112 dataset download
asian_seleb dataset : download http://trillionpairs.deepglint.com/overview and crob by matcnn(112 * 112)
ms1m_align_112 dataset and asian_align_112 put one directory.
make tfrecord file by convert_train_binary_tfrecord.py
I want to train by train.py....

Previously, a small dataset and ms1m dataset were combined and trained in the same way. Training was successful. However, this time, training is not possible. I don't know what's the ploblem...

Sorry and thank you for bothering me

peteryuX commented 4 years ago

It sounds weird~ From my experience, loss nan generally presents in two situations, 1. input data have some unexpected values; 2. loss divided with a near zero values, which might make gradients too large. You can trace which loss become nan firstly (like regulization l2 norm loss or arcface loss...?) to crash the training. I would try to figure it out when I am free. Please let me know if you find out the problem before then. Thanks!

shiney5213 commented 4 years ago

thank you for your answer. I'll do it myself. I'll try and ask for help if I have any further questions later have a nice day