the training duration of relationnet ++

sure7018 commented 3 years ago

Hello, what's the training duration of relationnet ++? Why does it take me so much time to train with a single GPU on the coco dataset??

2021-05-08 14:40:36,476 - mmdet - INFO - workflow: [('train', 1)], max: 20 epochs 2021-05-08 14:43:38,226 - mmdet - INFO - Epoch [1][50/58633] lr: 9.890e-04, eta: 49 days, 7:57:52, time: 3.635, data_time: 0.054, memory: 9028, kpt_loss_point_cls: 1.1461, kpt_loss_point_offset: 0.0875, bbox_loss_cls: 1.2137, bbox_loss_bbox: 0.7130, loss: 3.1603 2021-05-08 14:47:02,436 - mmdet - INFO - Epoch [1][100/58633] lr: 1.988e-03, eta: 52 days, 9:05:35, time: 4.084, data_time: 0.006, memory: 9566, kpt_loss_point_cls: 1.1375, kpt_loss_point_offset: 0.0869, bbox_loss_cls: 1.2221, bbox_loss_bbox: 0.7509, loss: 3.1974 2021-05-08 14:50:23,529 - mmdet - INFO - Epoch [1][150/58633] lr: 2.987e-03, eta: 53 days, 2:39:41, time: 4.022, data_time: 0.005, memory: 9566, kpt_loss_point_cls: 1.1513, kpt_loss_point_offset: 0.0866, bbox_loss_cls: 1.2318, bbox_loss_bbox: 0.7351, loss: 3.2049 2021-05-08 14:53:27,445 - mmdet - INFO - Epoch [1][200/58633] lr: 3.986e-03, eta: 52 days, 7:26:42, time: 3.678, data_time: 0.005, memory: 9566, kpt_loss_point_cls: 1.1245, kpt_loss_point_offset: 0.0859, bbox_loss_cls: 1.1802, bbox_loss_bbox: 0.6615, loss: 3.0520 2021-05-08 14:56:36,176 - mmdet - INFO - Epoch [1][250/58633] lr: 4.985e-03, eta: 52 days, 2:10:09, time: 3.775, data_time: 0.005, memory: 9566, kpt_loss_point_cls: 1.0461, kpt_loss_point_offset: 0.0853, bbox_loss_cls: 1.2102, bbox_loss_bbox: 0.7289, loss: 3.0704

shinya7y commented 3 years ago

Which config did you use?

sure7018 commented 3 years ago

i use is the bvr_retinanet_x101_fpn_dcn_mstrain_400_1200_20e_coco.py

python tools/train.py configs/bvr/bvr_retinanet_x101_fpn_dcn_mstrain_400_1200_20e_coco.py

shinya7y commented 3 years ago

The config has many heavy settings.

Please try the following: Res2Net-50 or Res2Net-101 stage_with_dcn=(False, False, False, True), ../_base_/datasets/coco_detection_mstrain_480_960.py with_cp=False fp16 = dict(loss_scale='dynamic') or fp16 = dict(loss_scale=512.)

shinya7y commented 3 years ago

Even only for inference, RelationNet++ is slow on T4. I may verify the paper's FPS by benchmarks on V100.

sure7018 commented 3 years ago

Thank you for your reply. I used resnet-50 for training, and the speed has been improved obviously, but the accuracy is not as high as that mentioned in the article. Is that the reason for epoch = 12？？？

shinya7y commented 3 years ago

If you use bvr_retinanet_r50_fpn_gn_1x_coco.py, an AP around 38.5 (the authors' result) is appropriate. The settings I recommended and training for 20 epochs will boost accuracy.

Please don't forget to change the learning rate according to the Linear Scaling Rule.

lr=0.01    for total batch size 16 (8 GPUs * 2 samples_per_gpu)
lr=0.00125 for total batch size 2  (1 GPU  * 2 samples_per_gpu)

sure7018 commented 3 years ago

Thank you very much for your reply. I will try it

sure7018 commented 3 years ago

The config has many heavy settings.

Please try the following: Res2Net-50 or Res2Net-101 stage_with_dcn=(False, False, False, True), ../_base_/datasets/coco_detection_mstrain_480_960.py with_cp=False fp16 = dict(loss_scale='dynamic') or fp16 = dict(loss_scale=512.)

Hello, will the accuracy be affected after the above modification？？

shinya7y commented 3 years ago

Res2Net-50 or Res2Net-101 affect accuracy. stage_with_dcn=(False, False, False, True), affects accuracy. ../_base_/datasets/coco_detection_mstrain_480_960.py affects accuracy. with_cp=False should not affect accuracy. fp16 = dict(loss_scale='dynamic') or fp16 = dict(loss_scale=512.) are expected not to affect accuracy.

shinya7y / UniverseNet

the training duration of relationnet ++ #21