open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.57k stars 9.46k forks source link

Cannot reimplement the results of CrowdDet. Why? #11061

Open Zzh-tju opened 1 year ago

Zzh-tju commented 1 year ago

Reimplement a model in the model zoo using the provided configs configs/crowddet/crowddet-rcnn_r50_fpn_8xb2-30e_crowdhuman.py, but I cannot reimplement the performance of 90.0 AP.

1105 10/18 17:21:55 - mmengine - INFO - Epoch(train) [30][750/938]  lr: 2.0000e-05  eta: 0:01:22  time: 0.4372  data_time: 0.0080  memory: 4261  loss: 0.7304  loss_rpn_cls: 0.2674  loss_rpn_bbox: 0.0803  loss_rcnn_emd: 0.3827
1106 10/18 17:22:16 - mmengine - INFO - Exp name: crowddet-rcnn_r50_fpn_8xb2-30e_20231018_122042
1107 10/18 17:22:18 - mmengine - INFO - Epoch(train) [30][800/938]  lr: 2.0000e-05  eta: 0:01:00  time: 0.4701  data_time: 0.0410  memory: 5175  loss: 0.7124  loss_rpn_cls: 0.2569  loss_rpn_bbox: 0.0804  loss_rcnn_emd: 0.3750
1108 10/18 17:22:42 - mmengine - INFO - Epoch(train) [30][850/938]  lr: 2.0000e-05  eta: 0:00:38  time: 0.4774  data_time: 0.0156  memory: 4261  loss: 0.7826  loss_rpn_cls: 0.2737  loss_rpn_bbox: 0.0831  loss_rcnn_emd: 0.4258
1109 10/18 17:23:03 - mmengine - INFO - Epoch(train) [30][900/938]  lr: 2.0000e-05  eta: 0:00:16  time: 0.4203  data_time: 0.0719  memory: 4532  loss: 0.8183  loss_rpn_cls: 0.2798  loss_rpn_bbox: 0.0863  loss_rcnn_emd: 0.4522
1110 10/18 17:23:20 - mmengine - INFO - Exp name: crowddet-rcnn_r50_fpn_8xb2-30e_20231018_122042
1111 10/18 17:23:20 - mmengine - INFO - Saving checkpoint at 30 epochs
1112 10/18 17:23:36 - mmengine - INFO - Epoch(val) [30][ 50/547]    eta: 0:02:19  time: 0.2817  data_time: 0.2119  memory: 5806
1113 10/18 17:23:47 - mmengine - INFO - Epoch(val) [30][100/547]    eta: 0:01:49  time: 0.2097  data_time: 0.1370  memory: 1028
1114 10/18 17:23:54 - mmengine - INFO - Epoch(val) [30][150/547]    eta: 0:01:23  time: 0.1372  data_time: 0.0691  memory: 1028
1115 10/18 17:24:02 - mmengine - INFO - Epoch(val) [30][200/547]    eta: 0:01:08  time: 0.1622  data_time: 0.0757  memory: 1028
1116 10/18 17:24:11 - mmengine - INFO - Epoch(val) [30][250/547]    eta: 0:00:57  time: 0.1792  data_time: 0.1128  memory: 1028
1117 10/18 17:24:20 - mmengine - INFO - Epoch(val) [30][300/547]    eta: 0:00:47  time: 0.1849  data_time: 0.1096  memory: 1028
1118 10/18 17:24:32 - mmengine - INFO - Epoch(val) [30][350/547]    eta: 0:00:39  time: 0.2397  data_time: 0.1650  memory: 1028
1119 10/18 17:24:44 - mmengine - INFO - Epoch(val) [30][400/547]    eta: 0:00:30  time: 0.2416  data_time: 0.1562  memory: 1028
1120 10/18 17:24:54 - mmengine - INFO - Epoch(val) [30][450/547]    eta: 0:00:19  time: 0.2067  data_time: 0.1307  memory: 1028
1121 10/18 17:25:06 - mmengine - INFO - Epoch(val) [30][500/547]    eta: 0:00:09  time: 0.2229  data_time: 0.1422  memory: 1028
1122 10/18 17:25:52 - mmengine - INFO - Evaluating AP...
1123 10/18 17:25:53 - mmengine - INFO - Evaluating MR...
1124 10/18 17:25:53 - mmengine - INFO - Evaluating JI...
1125 10/18 17:26:31 - mmengine - INFO - Epoch(val) [30][547/547]    crowd_human/mAP: 0.8631  crowd_human/mMR: 0.4947  crowd_human/JI: 0.7547  data_time: 0.1244  time: 0.2006

The train set is exactly the CrowdHuman train set, 15,000 images and the test set is the CrowdHuman val set, 4,370 images.

I don't know why the implementation results can only achieve 86.3 AP. I noticed that the loss_rcnn_emd term is lower than the one in the log provided by you.

The following is the first epoch, 62.7 AP, while your log shows that it is 75.2 AP.

 114 10/18 12:28:00 - mmengine - INFO - Epoch(train)  [1][850/938]  lr: 2.0000e-03  eta: 3:28:40  time: 0.4152  data_time: 0.0667  memory: 4261  loss: 1.3130  loss_rpn_cls: 0.4203  loss_rpn_bbox: 0.1160  loss_rcnn_emd: 0.7767
 115 10/18 12:28:23 - mmengine - INFO - Epoch(train)  [1][900/938]  lr: 2.0000e-03  eta: 3:28:25  time: 0.4639  data_time: 0.0433  memory: 4261  loss: 1.3516  loss_rpn_cls: 0.4069  loss_rpn_bbox: 0.1160  loss_rcnn_emd: 0.8287
 116 10/18 12:28:39 - mmengine - INFO - Exp name: crowddet-rcnn_r50_fpn_8xb2-30e_20231018_122042
 117 10/18 12:28:59 - mmengine - INFO - Epoch(val)  [1][ 50/547]    eta: 0:03:24  time: 0.4108  data_time: 0.2985  memory: 4482
 118 10/18 12:29:11 - mmengine - INFO - Epoch(val)  [1][100/547]    eta: 0:02:22  time: 0.2274  data_time: 0.1060  memory: 1028
 119 10/18 12:29:17 - mmengine - INFO - Epoch(val)  [1][150/547]    eta: 0:01:42  time: 0.1341  data_time: 0.0272  memory: 1028
 120 10/18 12:29:26 - mmengine - INFO - Epoch(val)  [1][200/547]    eta: 0:01:22  time: 0.1739  data_time: 0.0451  memory: 1028
 121 10/18 12:29:34 - mmengine - INFO - Epoch(val)  [1][250/547]    eta: 0:01:06  time: 0.1687  data_time: 0.0692  memory: 1028
 122 10/18 12:29:44 - mmengine - INFO - Epoch(val)  [1][300/547]    eta: 0:00:53  time: 0.1951  data_time: 0.0789  memory: 1028
 123 10/18 12:29:56 - mmengine - INFO - Epoch(val)  [1][350/547]    eta: 0:00:43  time: 0.2374  data_time: 0.1197  memory: 1028
 124 10/18 12:30:08 - mmengine - INFO - Epoch(val)  [1][400/547]    eta: 0:00:32  time: 0.2320  data_time: 0.1047  memory: 1028
 125 10/18 12:30:18 - mmengine - INFO - Epoch(val)  [1][450/547]    eta: 0:00:21  time: 0.2000  data_time: 0.0816  memory: 1028
 126 10/18 12:30:29 - mmengine - INFO - Epoch(val)  [1][500/547]    eta: 0:00:10  time: 0.2212  data_time: 0.0933  memory: 1028
 127 10/18 12:31:43 - mmengine - INFO - Evaluating AP...
 128 10/18 12:31:45 - mmengine - INFO - Evaluating MR...
 129 10/18 12:31:46 - mmengine - INFO - Evaluating JI...
 130 10/18 12:32:27 - mmengine - INFO - Epoch(val) [1][547/547]    crowd_human/mAP: 0.6268  crowd_human/mMR: 0.8442  crowd_human/JI: 0.5288  data_time: 0.0984  time: 0.2170
hh23333 commented 1 year ago

Hi, Have you solved the problem yet? I also encountered the same problem.

Zzh-tju commented 1 year ago

@hh23333 No, I cannot figure it out.

hh23333 commented 1 year ago

@Zzh-tju, Hi, it seems that the abnormal results is due to the different learning rate. When I changed the learning rate from 0.002 to 0.02, I got the reported results: [30][1093/1093] crowd_human/mAP: 0.8991 crowd_human/mMR: 0.4228 crowd_human/JI: 0.8009 data_time: 0.0111 time: 0.1743

bye111 commented 1 month ago

嗨,看来异常结果是由于学习率不同造成的。当我将学习率从 0.002 更改为 0.02 时,我得到了报告的结果: [30][1093/1093] crowd_human/mAP: 0.8991 crowd_human/mMR: 0.4228 crowd_human/JI: 0.8009 data_time: 0.0111 time: 0.1743

您好,请问您是用单GPU吗