open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.66k stars 1.22k forks source link

[Bug] Why does training ap10k with RTMpose suddenly become less effective? #2761

Open cf2xh123 opened 11 months ago

cf2xh123 commented 11 months ago

Prerequisite

Environment

System environment: sys.platform: win32 Python: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] CUDA available: True numpy_random_seed: 21 GPU 0: NVIDIA GeForce RTX 3080 CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2 NVCC: Cuda compilation tools, release 12.2, V12.2.140 MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.37.32822 版 GCC: n/a PyTorch: 1.13.1+cu116 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 21 Distributed launcher: none Distributed training: False GPU number: 1

Reproduces the problem - code sample

The configuration file is exactly the same as configs/animal_2d_keypoint/rtmpose/ap10k/rtmpose-m_8xb64-210e_ap10k-256x256.py

Reproduces the problem - command or script

python tools/train.py configs/animal_2d_keypoint/rtmpose/ap10k/rtmpose-m_8xb64-210e_ap10k-256x256.py

Reproduces the problem - error message

2023/10/18 00:20:48 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 2023/10/18 00:20:48 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 2023/10/18 00:20:48 - mmengine - INFO - Checkpoints will be saved to D:\PythonProject\mmpose\work_dirs\rtmpose-m_8xb64-500e_ap10k-256x256. 2023/10/18 00:21:40 - mmengine - INFO - Epoch(train) [1][ 50/143] base_lr: 1.962342e-04 lr: 1.962342e-04 eta: 20:41:20 time: 1.042419 data_time: 0.752816 memory: 5710 loss: 0.548346 loss_kpt: 0.548346 acc_pose: 0.059021 2023/10/18 00:21:52 - mmengine - INFO - Epoch(train) [1][100/143] base_lr: 3.964324e-04 lr: 3.964324e-04 eta: 12:38:08 time: 0.231760 data_time: 0.016910 memory: 5710 loss: 0.478946 loss_kpt: 0.478946 acc_pose: 0.161422 2023/10/18 00:22:02 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:22:14 - mmengine - INFO - Epoch(train) [2][ 50/143] base_lr: 7.688011e-04 lr: 7.688011e-04 eta: 8:50:37 time: 0.250789 data_time: 0.033231 memory: 5710 loss: 0.424151 loss_kpt: 0.424151 acc_pose: 0.222868 2023/10/18 00:22:26 - mmengine - INFO - Epoch(train) [2][100/143] base_lr: 9.689993e-04 lr: 9.689993e-04 eta: 7:57:49 time: 0.231980 data_time: 0.015622 memory: 5710 loss: 0.409112 loss_kpt: 0.409112 acc_pose: 0.250407 2023/10/18 00:22:36 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:22:48 - mmengine - INFO - Epoch(train) [3][ 50/143] base_lr: 1.341368e-03 lr: 1.341368e-03 eta: 7:04:25 time: 0.252031 data_time: 0.032206 memory: 5710 loss: 0.402685 loss_kpt: 0.402685 acc_pose: 0.273581 2023/10/18 00:23:00 - mmengine - INFO - Epoch(train) [3][100/143] base_lr: 1.541566e-03 lr: 1.541566e-03 eta: 6:44:42 time: 0.231379 data_time: 0.015422 memory: 5710 loss: 0.400691 loss_kpt: 0.400691 acc_pose: 0.346967 2023/10/18 00:23:10 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:23:23 - mmengine - INFO - Epoch(train) [4][ 50/143] base_lr: 1.913935e-03 lr: 1.913935e-03 eta: 6:21:34 time: 0.253953 data_time: 0.034518 memory: 5710 loss: 0.397241 loss_kpt: 0.397241 acc_pose: 0.327759 2023/10/18 00:23:34 - mmengine - INFO - Epoch(train) [4][100/143] base_lr: 2.114133e-03 lr: 2.114133e-03 eta: 6:11:20 time: 0.233193 data_time: 0.017014 memory: 5710 loss: 0.396639 loss_kpt: 0.396639 acc_pose: 0.240420 2023/10/18 00:23:44 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:23:57 - mmengine - INFO - Epoch(train) [5][ 50/143] base_lr: 2.486502e-03 lr: 2.486502e-03 eta: 5:58:11 time: 0.254400 data_time: 0.036682 memory: 5710 loss: 0.389717 loss_kpt: 0.389717 acc_pose: 0.321183 2023/10/18 00:24:08 - mmengine - INFO - Epoch(train) [5][100/143] base_lr: 2.686700e-03 lr: 2.686700e-03 eta: 5:51:39 time: 0.231737 data_time: 0.015455 memory: 5710 loss: 0.396755 loss_kpt: 0.396755 acc_pose: 0.319926 2023/10/18 00:24:18 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:24:31 - mmengine - INFO - Epoch(train) [6][ 50/143] base_lr: 3.059068e-03 lr: 3.059068e-03 eta: 5:43:17 time: 0.252813 data_time: 0.035172 memory: 5710 loss: 0.392658 loss_kpt: 0.392658 acc_pose: 0.361447 2023/10/18 00:24:43 - mmengine - INFO - Epoch(train) [6][100/143] base_lr: 3.259267e-03 lr: 3.259267e-03 eta: 5:38:50 time: 0.232870 data_time: 0.017127 memory: 5710 loss: 0.404800 loss_kpt: 0.404800 acc_pose: 0.385464 2023/10/18 00:24:52 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:25:05 - mmengine - INFO - Epoch(train) [7][ 50/143] base_lr: 3.631635e-03 lr: 3.631635e-03 eta: 5:32:35 time: 0.249766 data_time: 0.029781 memory: 5710 loss: 0.399166 loss_kpt: 0.399166 acc_pose: 0.361256 2023/10/18 00:25:16 - mmengine - INFO - Epoch(train) [7][100/143] base_lr: 3.831834e-03 lr: 3.831834e-03 eta: 5:29:14 time: 0.232104 data_time: 0.013133 memory: 5710 loss: 0.396209 loss_kpt: 0.396209 acc_pose: 0.275891 2023/10/18 00:25:26 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:25:26 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:25:39 - mmengine - INFO - Epoch(train) [8][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:25:00 time: 0.256945 data_time: 0.035614 memory: 5710 loss: 0.399316 loss_kpt: 0.399316 acc_pose: 0.360232 2023/10/18 00:25:51 - mmengine - INFO - Epoch(train) [8][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:22:30 time: 0.233960 data_time: 0.014754 memory: 5710 loss: 0.390495 loss_kpt: 0.390495 acc_pose: 0.309002 2023/10/18 00:26:01 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:26:13 - mmengine - INFO - Epoch(train) [9][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:19:01 time: 0.249891 data_time: 0.031512 memory: 5710 loss: 0.393994 loss_kpt: 0.393994 acc_pose: 0.353903 2023/10/18 00:26:25 - mmengine - INFO - Epoch(train) [9][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:16:55 time: 0.232631 data_time: 0.014777 memory: 5710 loss: 0.388632 loss_kpt: 0.388632 acc_pose: 0.315286 2023/10/18 00:26:35 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:26:47 - mmengine - INFO - Epoch(train) [10][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:13:56 time: 0.249112 data_time: 0.029336 memory: 5710 loss: 0.381478 loss_kpt: 0.381478 acc_pose: 0.381321 2023/10/18 00:26:59 - mmengine - INFO - Epoch(train) [10][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:12:08 time: 0.231064 data_time: 0.014480 memory: 5710 loss: 12.005985 loss_kpt: 12.005985 acc_pose: 0.004994 2023/10/18 00:27:08 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:27:08 - mmengine - INFO - Saving checkpoint at 10 epochs 2023/10/18 00:27:42 - mmengine - INFO - Evaluating CocoMetric... 2023/10/18 00:27:43 - mmengine - INFO - Epoch(val) [10][40/40] coco/AP: 0.000000 coco/AP .5: 0.000000 coco/AP .75: 0.000000 coco/AP (M): 0.000000 coco/AP (L): 0.000000 coco/AR: 0.000000 coco/AR .5: 0.000000 coco/AR .75: 0.000000 coco/AR (M): 0.000000 coco/AR (L): 0.000000 data_time: 0.737454 time: 0.805291 2023/10/18 00:27:43 - mmengine - INFO - The best checkpoint with 0.0000 coco/AP at 10 epoch is saved to best_coco_AP_epoch_10.pth. 2023/10/18 00:27:58 - mmengine - INFO - Epoch(train) [11][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:10:29 time: 0.270464 data_time: 0.036262 memory: 5710 loss: 0.556856 loss_kpt: 0.556856 acc_pose: 0.011304 2023/10/18 00:28:10 - mmengine - INFO - Epoch(train) [11][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:09:16 time: 0.239675 data_time: 0.016348 memory: 5710 loss: 0.560054 loss_kpt: 0.560054 acc_pose: 0.008085 2023/10/18 00:28:21 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:28:34 - mmengine - INFO - Epoch(train) [12][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:08:11 time: 0.265447 data_time: 0.034719 memory: 5710 loss: 0.544330 loss_kpt: 0.544330 acc_pose: 0.037530 2023/10/18 00:28:46 - mmengine - INFO - Epoch(train) [12][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:07:02 time: 0.237683 data_time: 0.015697 memory: 5710 loss: 0.540440 loss_kpt: 0.540440 acc_pose: 0.033283 2023/10/18 00:28:56 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:29:09 - mmengine - INFO - Epoch(train) [13][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:05:40 time: 0.260813 data_time: 0.035876 memory: 5710 loss: 0.536646 loss_kpt: 0.536646 acc_pose: 0.027879 2023/10/18 00:29:21 - mmengine - INFO - Epoch(train) [13][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:04:39 time: 0.237838 data_time: 0.015611 memory: 5710 loss: 0.534968 loss_kpt: 0.534968 acc_pose: 0.035510 2023/10/18 00:29:31 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:29:44 - mmengine - INFO - Epoch(train) [14][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:03:26 time: 0.257158 data_time: 0.034105 memory: 5710 loss: 0.533824 loss_kpt: 0.533824 acc_pose: 0.036428 2023/10/18 00:29:56 - mmengine - INFO - Epoch(train) [14][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:02:35 time: 0.239957 data_time: 0.014287 memory: 5710 loss: 0.536512 loss_kpt: 0.536512 acc_pose: 0.029576 2023/10/18 00:30:06 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:30:06 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:30:19 - mmengine - INFO - Epoch(train) [15][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:01:13 time: 0.250323 data_time: 0.033040 memory: 5710 loss: 0.527022 loss_kpt: 0.527022 acc_pose: 0.027368 2023/10/18 00:30:30 - mmengine - INFO - Epoch(train) [15][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 5:00:13 time: 0.232205 data_time: 0.015000 memory: 5710 loss: 0.539165 loss_kpt: 0.539165 acc_pose: 0.033150 2023/10/18 00:30:40 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:30:53 - mmengine - INFO - Epoch(train) [16][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:58:54 time: 0.251053 data_time: 0.033881 memory: 5710 loss: 0.530794 loss_kpt: 0.530794 acc_pose: 0.030388 2023/10/18 00:31:04 - mmengine - INFO - Epoch(train) [16][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:58:01 time: 0.232956 data_time: 0.015372 memory: 5710 loss: 0.538113 loss_kpt: 0.538113 acc_pose: 0.033432 2023/10/18 00:31:14 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:31:27 - mmengine - INFO - Epoch(train) [17][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:56:51 time: 0.251190 data_time: 0.034545 memory: 5710 loss: 0.536650 loss_kpt: 0.536650 acc_pose: 0.037894 2023/10/18 00:31:38 - mmengine - INFO - Epoch(train) [17][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:56:01 time: 0.232282 data_time: 0.015212 memory: 5710 loss: 0.535692 loss_kpt: 0.535692 acc_pose: 0.033107 2023/10/18 00:31:48 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:32:01 - mmengine - INFO - Epoch(train) [18][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:54:54 time: 0.250212 data_time: 0.033074 memory: 5710 loss: 0.537177 loss_kpt: 0.537177 acc_pose: 0.039232 2023/10/18 00:32:12 - mmengine - INFO - Epoch(train) [18][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:54:10 time: 0.233600 data_time: 0.017417 memory: 5710 loss: 0.525290 loss_kpt: 0.525290 acc_pose: 0.037828 2023/10/18 00:32:22 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:32:35 - mmengine - INFO - Epoch(train) [19][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:53:13 time: 0.251450 data_time: 0.033854 memory: 5710 loss: 0.530711 loss_kpt: 0.530711 acc_pose: 0.030636 2023/10/18 00:32:46 - mmengine - INFO - Epoch(train) [19][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:52:31 time: 0.232551 data_time: 0.015647 memory: 5710 loss: 0.532736 loss_kpt: 0.532736 acc_pose: 0.025157 2023/10/18 00:32:56 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:33:09 - mmengine - INFO - Epoch(train) [20][ 50/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:51:32 time: 0.248940 data_time: 0.032586 memory: 5710 loss: 0.533562 loss_kpt: 0.533562 acc_pose: 0.022992 2023/10/18 00:33:20 - mmengine - INFO - Epoch(train) [20][100/143] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 4:50:53 time: 0.232542 data_time: 0.013506 memory: 5710 loss: 0.529808 loss_kpt: 0.529808 acc_pose: 0.045873 2023/10/18 00:33:30 - mmengine - INFO - Exp name: rtmpose-m_8xb64-500e_ap10k-256x256_20231018_002040 2023/10/18 00:33:30 - mmengine - INFO - Saving checkpoint at 20 epochs 2023/10/18 00:33:34 - mmengine - INFO - Evaluating CocoMetric... 2023/10/18 00:33:35 - mmengine - INFO - Epoch(val) [20][40/40] coco/AP: 0.000001 coco/AP .5: 0.000006 coco/AP .75: 0.000000 coco/AP (M): 0.000000 coco/AP (L): 0.000001 coco/AR: 0.000018 coco/AR .5: 0.000182 coco/AR .75: 0.000000 coco/AR (M): 0.000000 coco/AR (L): 0.000018 data_time: 0.013320 time: 0.072213

Additional information

  1. I think the results should not be too far from the official log, but then at the tenth Epoch the parameters look like they have been reset.
  2. The dataset I used for training is AP10K.
Tau-J commented 11 months ago

The learning rate, 4e-3, of RTMPose is for 8-gpu training, so you need to adjust it when you conduct single gpu training.