Time consuming about your model

Yangr116 commented 2 years ago

I have a question about time consuming when training 270 iter on coco dataset. batch_size(64) * 270 k / 118k = 146 epoch. So, it may be a unfair comparision with other model.

wondervictor commented 2 years ago

Thanks for your interest in this work! Take sparse_inst_r50vd_giam.yaml (SparseInst-R50vd, G-IAM) for example, it takes about two days (46 hours) to finish training with 8 NVIDIA 2080 Ti GPUs (training log), which is acceptable for a 3x schedule, e.g., Mask R-CNN. Indeed, SparseInst as a new framework requires more training epochs for convergence, which is a bit different from previous methods. Besides, extending the training schedule still brings significant improvements. In comparison, we select the highest results for other real-time methods despite the schedules, augmentations, and backbones. Some methods adopt COCO-Detection pretrained weights or also require longer schedules while other methods have limited improvements with more iterations/epochs of training. 3x schedule is adequate for most methods, e.g., [1] but not SparseInst. Compared to other methods, SparseInst can achieve faster speed with highly competitive accuracy. Moreover, we are still working on it to accelerate the convergence speed of SparseInst 😊. If you prefer a shorter and faster schedule, you could reduce the batch size or iterations. As far as I know, reducing the batch size from 64 to 32 brings a 0.8~1.2 AP drop.

[1] He et.al. Rethinking ImageNet Pre-training. ICCV 2019.

Yangr116 commented 2 years ago

Thanks for you reply, I see~

feiyuhuahuo commented 2 years ago

@wondervictor Do you know why SparseInst converges slowly? I also found some networks converges slower than other networks. Is there any common reason?

wondervictor commented 2 years ago

Hi @feiyuhuahuo! The main reason for the slow convergence of SparseInst is that the instance activation maps require more iterations to learn where and how to highlight objects through the bipartite matching loss. DETR and SparseInst have similar phenomena. And SparseInst does not adopt any spatial priors to facilitate the convergence [1,2] while it might be a clue to accelerating the convergence. Moreover, we find that the Hungarian Algorithm based on optimal matching will suppress some highly-confident predictions as negative samples or match one prediction, which has a high matching score with a ground-truth object, to another ground-truth object. These two cases in training will affect the convergence speed. In addition, each ground-truth object can only be assigned one prediction/sample, which also slows down the convergence speed ([3] adopts multiple positive samples to accelerate training and improve performance). As for the second question, would you provide more detailed cases?

[1] Meng et.al. Conditional DETR for Fast Training Convergence. ICCV, 2021. [2] Gao et.al. Fast Convergence of DETR with Spatially Modulated Co-Attention. ICCV, 2021. [3] Ge et.al. YOLOX: Exceeding YOLO Series in 2021

dongbo811 commented 1 year ago

Hello, why does sparse instance take about 7 days to train on 8*A100.

and it takes 14 days on 8*V100.

Is there something mistake in my process? [32m[10/11 13:37:24 d2.utils.events]: [0m eta: 7 days, 18:11:51 iter: 19 total_loss: 7.083 loss_ce: 2.253 loss_objectness: 0.7968 loss_dice: 1.853 loss_mask: 2.188 time: 2.5095 data_time: 1.4983 lr: 9.9905e-07 max_mem: 6130M [32m[10/11 13:38:14 d2.utils.events]: [0m eta: 7 days, 18:39:47 iter: 39 total_loss: 5.989 loss_ce: 2.249 loss_objectness: 0.7798 loss_dice: 1.935 loss_mask: 1.007 time: 2.5097 data_time: 0.0564 lr: 1.998e-06 max_mem: 7356M [32m[10/11 13:39:05 d2.utils.events]: [0m eta: 7 days, 21:21:05 iter: 59 total_loss: 5.662 loss_ce: 2.25 loss_objectness: 0.6874 loss_dice: 1.877 loss_mask: 0.8537 time: 2.5263 data_time: 0.0570 lr: 2.997e-06 max_mem: 7432M [32m[10/11 13:39:54 d2.utils.events]: [0m eta: 7 days, 18:34:02 iter: 79 total_loss: 5.394 loss_ce: 2.238 loss_objectness: 0.5633 loss_dice: 1.832 loss_mask: 0.7363 time: 2.5130 data_time: 0.0586 lr: 3.9961e-06 max_mem: 7432M [32m[10/11 13:40:43 d2.utils.events]: [0m eta: 7 days, 18:00:48 iter: 99 total_loss: 5.045 loss_ce: 2.207 loss_objectness: 0.3569 loss_dice: 1.76 loss_mask: 0.7324 time: 2.4961 data_time: 0.0574 lr: 4.995e-06 max_mem: 7432M [32m[10/11 13:41:33 d2.utils.events]: [0m eta: 7 days, 18:16:42 iter: 119 total_loss: 4.731 loss_ce: 2.101 loss_objectness: 0.2506 loss_dice: 1.737 loss_mask: 0.6552 time: 2.4927 data_time: 0.0565 lr: 5.9941e-06 max_mem: 7432M [32m[10/11 13:42:23 d2.utils.events]: [0m eta: 7 days, 18:15:53 iter: 139 total_loss: 4.399 loss_ce: 1.899 loss_objectness: 0.2546 loss_dice: 1.696 loss_mask: 0.5409 time: 2.4965 data_time: 0.0575 lr: 6.9931e-06 max_mem: 7432M [32m[10/11 13:43:14 d2.utils.events]: [0m eta: 7 days, 18:42:10 iter: 159 total_loss: 4.277 loss_ce: 1.731 loss_objectness: 0.2617 loss_dice: 1.674 loss_mask: 0.5749 time: 2.5035 data_time: 0.0538 lr: 7.9921e-06 max_mem: 7432M [32m[10/11 13:44:06 d2.utils.events]: [0m eta: 7 days, 19:14:03 iter: 179 total_loss: 3.99 loss_ce: 1.591 loss_objectness: 0.2695 loss_dice: 1.644 loss_mask: 0.5031 time: 2.5150 data_time: 0.0566 lr: 8.991e-06 max_mem: 7432M [32m[10/11 13:44:56 d2.utils.events]: [0m eta: 7 days, 18:13:24 iter: 199 total_loss: 3.867 loss_ce: 1.482 loss_objectness: 0.2982 loss_dice: 1.596 loss_mask: 0.5051 time: 2.5147 data_time: 0.0605 lr: 9.99e-06 max_mem: 7432M [32m[10/11 13:45:46 d2.utils.events]: [0m eta: 7 days, 18:00:06 iter: 219 total_loss: 3.752 loss_ce: 1.399 loss_objectness: 0.31

hustvl / SparseInst

Time consuming about your model #1