shamangary / FSA-Net

[CVPR19] FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation from a Single Image
Apache License 2.0
608 stars 155 forks source link

When training the model, the loss is nan. #44

Open CrossEntropy opened 4 years ago

CrossEntropy commented 4 years ago

Hi, @shamangary ! I got the following error while training the model FSA_net_Var_Capsules

6120/6120 [==============================] - 459s 75ms/step - loss: 10.4547 - val_loss: 7.6488

Epoch 00001: val_loss improved from inf to 7.64882, saving model to 300W_LP_checkpoints/weights.01-7.65.hdf5
Epoch 2/90
6120/6120 [==============================] - 425s 69ms/step - loss: 7.2023 - val_loss: 5.7376

Epoch 00002: val_loss improved from 7.64882 to 5.73757, saving model to 300W_LP_checkpoints/weights.02-5.74.hdf5
Epoch 3/90
6120/6120 [==============================] - 442s 72ms/step - loss: 6.0585 - val_loss: 5.1815

Epoch 00003: val_loss improved from 5.73757 to 5.18146, saving model to 300W_LP_checkpoints/weights.03-5.18.hdf5
Epoch 4/90
6120/6120 [==============================] - 431s 70ms/step - loss: nan - val_loss: nan

Epoch 00004: val_loss did not improve from 5.18146
Epoch 5/90
6120/6120 [==============================] - 425s 69ms/step - loss: nan - val_loss: nan

Epoch 00005: val_loss did not improve from 5.18146
Epoch 6/90
6120/6120 [==============================] - 424s 69ms/step - loss: nan - val_loss: nan

Epoch 00006: val_loss did not improve from 5.18146
Epoch 7/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

Epoch 00007: val_loss did not improve from 5.18146
Epoch 8/90
6120/6120 [==============================] - 421s 69ms/step - loss: nan - val_loss: nan

Epoch 00008: val_loss did not improve from 5.18146
Epoch 9/90
6120/6120 [==============================] - 423s 69ms/step - loss: nan - val_loss: nan

And the same phenomenon also appeared in the model I built myself, my model only replaced the ssr_G_model_build part. Thanks for your help!

CrossEntropy commented 4 years ago

When I use tesnorflow2.0, I set the BatchSize to 128, although the nan will appear, the model still recycles the face. This is really amazing. ToT I suspect it may be a problem with the score function processing method. As you described in your paper, there are three methods:

(1) variance (2) 1x1 convolution (3) uniform.

I think the method of variance can reduce the amount of parameters, so I choice it. Looking forward to your reply!

shamangary commented 4 years ago

Hello @CrossEntropy,

It's been a long time since I ran this repo. My suggestion is use smaller batch like 32 or 16, and use a lower version of Tensorflow and Keras since they have updated it recently.

shamangary commented 4 years ago

https://github.com/tensorflow/tensorflow/issues/3290 https://github.com/tensorflow/tensorflow/issues/8101 It seems like tf.nn.moments could possibly return nan. You may pick out the nana from the variance and put zero back in. I assume this would solve the issue.