train not converge - Githubissues

chuck0518 commented 3 years ago

Hello,i train by the train.py but not converge,the results is wrong.My environment is ubuntu16.04 and python3.6.2,tensorflow2.4,could you give me some advice?

Epoch 000: Train Loss: 1.061, Accuracy: 30.17884% 2020-11-19 17:06:08 Epoch 001: Train Loss: 0.715, Accuracy: 23.60993% 2020-11-19 17:07:15 Epoch 002: Train Loss: 0.621, Accuracy: 20.01843% 2020-11-19 17:08:42 Epoch 003: Train Loss: 0.481, Accuracy: 13.48590% 2020-11-19 17:09:47 Epoch 004: Train Loss: 0.266, Accuracy: 5.80015% Epoch 004, Validation accuracy: 2.96213% 2020-11-19 17:11:15 Epoch 005: Train Loss: 0.164, Accuracy: 1.45889% 2020-11-19 17:12:21 Epoch 006: Train Loss: 0.123, Accuracy: 1.06765% 2020-11-19 17:13:27 Epoch 007: Train Loss: 0.091, Accuracy: 1.22532% 2020-11-19 17:14:33 Epoch 008: Train Loss: 0.089, Accuracy: 1.19700% 2020-11-19 17:15:39 Epoch 009: Train Loss: 0.078, Accuracy: 0.89985% Epoch 009, Validation accuracy: 0.81636% 2020-11-19 17:17:06 Epoch 010: Train Loss: 0.073, Accuracy: 0.73148% 2020-11-19 17:18:12 Epoch 011: Train Loss: 0.071, Accuracy: 0.69256% 2020-11-19 17:19:18 Epoch 012: Train Loss: 0.068, Accuracy: 0.71409% 2020-11-19 17:20:24 Epoch 013: Train Loss: 0.065, Accuracy: 0.70186% 2020-11-19 17:21:30 Epoch 014: Train Loss: 0.064, Accuracy: 0.67790% Epoch 014, Validation accuracy: 0.67563%

jiang-du commented 3 years ago

From my point of view, BlazePose is a tiny network, so it does not matter if we train more epochs. Since the initialization is randomized, if the starting point is not ideal, you can terminate with Ctrl+C and run another time. The following is part of my result. The platform is Ubuntu 20.10, CUDA 11.1, Cudnn 8.0, python 3.7.9, TensorFlow 2.5 nightly (compiled from source code to support RTX 3090 GPUs).

Initial Validation accuracy: 24.05149% 2020-11-23 11:14:05 Epoch 000: Train Loss: 0.640, Accuracy: 20.81211% 2020-11-23 11:14:11 Epoch 001: Train Loss: 0.513, Accuracy: 15.54497% 2020-11-23 11:14:16 Epoch 002: Train Loss: 0.386, Accuracy: 11.05893% 2020-11-23 11:14:21 Epoch 003: Train Loss: 0.243, Accuracy: 5.45178% 2020-11-23 11:14:27 Epoch 004: Train Loss: 0.135, Accuracy: 1.36844% Epoch 004, Validation accuracy: 0.92297% 2020-11-23 11:14:35 Epoch 005: Train Loss: 0.103, Accuracy: 0.85607% 2020-11-23 11:14:40 Epoch 006: Train Loss: 0.083, Accuracy: 0.86038% 2020-11-23 11:14:46 Epoch 007: Train Loss: 0.075, Accuracy: 0.77044% 2020-11-23 11:14:51 Epoch 008: Train Loss: 0.069, Accuracy: 0.63809% 2020-11-23 11:14:56 Epoch 009: Train Loss: 0.065, Accuracy: 0.57334% Epoch 009, Validation accuracy: 0.57323% 2020-11-23 11:15:05 Epoch 010: Train Loss: 0.061, Accuracy: 0.53838% 2020-11-23 11:15:10 Epoch 011: Train Loss: 0.058, Accuracy: 0.49956% 2020-11-23 11:15:16 Epoch 012: Train Loss: 0.055, Accuracy: 0.48502% 2020-11-23 11:15:21 Epoch 013: Train Loss: 0.053, Accuracy: 0.45574% 2020-11-23 11:15:26 Epoch 014: Train Loss: 0.050, Accuracy: 0.43685% Epoch 014, Validation accuracy: 0.41884% 2020-11-23 11:15:35 Epoch 015: Train Loss: 0.047, Accuracy: 0.40925% 2020-11-23 11:15:40 Epoch 016: Train Loss: 0.044, Accuracy: 0.37725% ...... 2020-11-23 11:24:03 Epoch 098: Train Loss: 0.028, Accuracy: 0.27661% 2020-11-23 11:24:08 Epoch 099: Train Loss: 0.028, Accuracy: 0.27590% Epoch 099, Validation accuracy: 0.27829% ...... 2020-11-23 11:28:36 Epoch 143: Train Loss: 0.026, Accuracy: 0.26574% 2020-11-23 11:28:41 Epoch 144: Train Loss: 0.026, Accuracy: 0.26553% Epoch 144, Validation accuracy: 0.26840% ......

It may take approximately 300+ epochs to converge.

Once you see the training loss and accuracy drops but validation accuracy increases, modify config.py and set train_mode to fine-tune. Then, run the training code from where you just stopped.

chuck0518 commented 3 years ago

Hello，as you say，when training loss and accuracy drops but validation accuracy increases,i modify train_mode = 0 and continue_train = 44 ,then running the training code,But the result is wrong, the log like this:

2020-11-23 17:38:16 Start train. WARNING:tensorflow:Layer blaze_pose is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call tf.keras.backend.set_floatx('float64'). To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Initial Validation accuracy: 1429626.17188% 2020-11-23 17:39:42 Epoch 000: Train Loss: 6532.363, Accuracy: 352504.88281% 2020-11-23 17:40:46 Epoch 001: Train Loss: 1922.987, Accuracy: 178801.91650% 2020-11-23 17:41:50 Epoch 002: Train Loss: 1423.902, Accuracy: 139074.48730%

jiang-du commented 3 years ago

Don't worry. It works well.

Take my training case as an example.

For pre-train, the optimal validation accuracy achieved here:

Epoch 432: Train Loss: 0.022, Accuracy: 0.22764%
2020-11-23 21:20:48
Epoch 433: Train Loss: 0.022, Accuracy: 0.22659%
2020-11-23 21:20:53
Epoch 434: Train Loss: 0.022, Accuracy: 0.22670%
Epoch 434, Validation accuracy: 0.25374%

For fine-tune, it looks like this:

2020-11-23 21:49:27  Start train.
2020-11-23 21:49:28.707345: I tensorflow/stream_executor/platform/default/dso_loader.cc:49]
Successfully opened dynamic library libcudnn.so.8
2020-11-23 21:49:30.408507: I tensorflow/stream_executor/cuda/cuda_dnn.cc:344]
Loaded cuDNN version 8005
2020-11-23 21:49:31.083267: I tensorflow/stream_executor/platform/default/dso_loader.cc:49]
Successfully opened dynamic library libcublas.so.11
2020-11-23 21:49:31.682031: I tensorflow/stream_executor/platform/default/dso_loader.cc:49]
Successfully opened dynamic library libcublasLt.so.11
2020-11-23 21:49:31.718689: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838]
TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
Initial Validation accuracy: 3216005.07812%
2020-11-23 21:49:40
Epoch 000: Train Loss: 12113.500, Accuracy: 501250.29297%
2020-11-23 21:49:44
Epoch 001: Train Loss: 3398.506, Accuracy: 272764.62402%
2020-11-23 21:49:48
Epoch 002: Train Loss: 1716.706, Accuracy: 163933.25195%
2020-11-23 21:49:51
Epoch 003: Train Loss: 1555.361, Accuracy: 153729.83398%
2020-11-23 21:49:55
Epoch 004: Train Loss: 1419.106, Accuracy: 136227.24609%
Epoch 004, Validation accuracy: 132222.63184%
2020-11-23 21:50:01
Epoch 005: Train Loss: 1186.121, Accuracy: 114443.84766%
2020-11-23 21:50:05
Epoch 006: Train Loss: 1085.480, Accuracy: 108097.48535%
2020-11-23 21:50:09
Epoch 007: Train Loss: 1084.292, Accuracy: 107970.82520%
2020-11-23 21:50:13
Epoch 008: Train Loss: 1043.337, Accuracy: 102373.27271%
2020-11-23 21:50:17
Epoch 009: Train Loss: 978.664, Accuracy: 97105.95093%
Epoch 009, Validation accuracy: 104547.15576%
2020-11-23 21:50:23
Epoch 010: Train Loss: 969.211, Accuracy: 96653.14331%
2020-11-23 21:50:27
Epoch 011: Train Loss: 953.235, Accuracy: 94580.92651%
2020-11-23 21:50:31
Epoch 012: Train Loss: 927.801, Accuracy: 92139.25781%
2020-11-23 21:50:35
Epoch 013: Train Loss: 913.414, Accuracy: 90818.38379%
2020-11-23 21:50:39
Epoch 014: Train Loss: 898.986, Accuracy: 89384.46655%
Epoch 014, Validation accuracy: 98831.06079%

......

(best validation accuracy)
2020-11-23 22:03:50
Epoch 199: Train Loss: 247.112, Accuracy: 24495.69397%
Epoch 199, Validation accuracy: 84674.26147%

......

(overfitting)
Epoch 497: Train Loss: 22.877, Accuracy: 2242.05990%
2020-11-23 22:25:35
Epoch 498: Train Loss: 22.320, Accuracy: 2217.19151%
2020-11-23 22:25:39
Epoch 499: Train Loss: 19.981, Accuracy: 1947.34421%
Epoch 499, Validation accuracy: 91264.93530%

It is obvious that the optimation keeps getting better, although the loss value looks huge. (This may be because I did not normalize the image size to the loss calculation.)

It is not a good idea if the loss (at fine-tune) becomes too small, since that will cause overfitting.

chuck0518 commented 3 years ago

Thank you for your detailed explanation. I have two questions. First, how can we judge whether it is overfitting? Is it through the value of loss or something else? Second, LSP dataset is a small dataset, and the network structure is also very small. Can the training get the best performance? Why not train a pre-training model with a dataset similar to mscoco?

jiang-du commented 3 years ago

Thank you for your detailed explanation. I have two questions. First, how can we judge whether it is overfitting? Is it through the value of loss or something else? Second, LSP dataset is a small dataset, and the network structure is also very small. Can the training get the best performance? Why not train a pre-training model with a dataset similar to mscoco?

Yes. We can see the training and validation accuracy during training.
The performance of this network is of course not so high as those large networks.

My intention to employ BlazePose was because of its lightweight. We can easily put this network on an embedded device to work in real-time with low power cost.

I haven't tried this network on MPII or MS-COCO. If you would like to train on such a dataset, please let me know how it works.

jiang-du / BlazePose-tensorflow

train not converge #7