Log files for training - Githubissues

adityaarun1 commented 4 years ago

Hi, Can you share the log files for your training? I am unable to reproduce the performance of IRN reported in the paper using the default hyper-parameters (also mentioned here [Link]).

For instance segmentation, instead of 37.7 mAP@0.5, I am getting the following:

step.eval_ins_seg: Wed Aug 14 09:55:44 2019
0.5iou: {'ap': array([0.0402722 , 0.        , 0.04831983, 0.02532846, 0.01264213,
       0.21497569, 0.13079764, 0.06767052, 0.00229753, 0.08129419,
       0.01570647, 0.05994737, 0.03092302, 0.26370536, 0.02019956,
       0.02099569, 0.0646912 , 0.16558015, 0.23535844, 0.1566734 ]), 'map': 0.08286894241843508}

and for semantic segmentation, instead of 66.5 mIOU, I am getting:

step.eval_sem_seg: Wed Aug 14 10:15:06 2019
0.12114407058527121 0.08625727491374735
0.2459830480445712 0.30624211370783205
{'iou': array([0.79259865, 0.43975817, 0.27018399, 0.42519734, 0.34189571,
       0.43639392, 0.57453956, 0.48851971, 0.41510347, 0.26892431,
       0.54274295, 0.37697739, 0.40495999, 0.47331797, 0.5605337 ,
       0.51401678, 0.39511615, 0.63538235, 0.40350322, 0.50775112,
       0.48067896]), 'miou': 0.4641950199739483}

Thanks.

adityaarun1 commented 4 years ago

I figured out the issue. When I use a single GPU to train IRN then it works fine, when training on multiple GPUs the results are bad. I am closing this issue. Also, it will be nice if someone can figure out what's wrong with adding torch.nn.DataParallel in IRN [here].

zhaohui-yang commented 4 years ago

@adityaarun1 Have you solved this problem? I utilized 4 GPUs to run train_irn.py in parallel and encountered the same problem with you. However, my Titan with 12G could only train irn with batch_size = 16, which achieved mAP = 35.7, I wonder how you solved this problem, thanks!

adityaarun1 commented 4 years ago

@zhaohui-yang No, I haven't been able to solve this. Adding torch.nn.DataParallel works fine while training, but I am unable to replicate results using that.

I also tried running on a larger GPU (V100) using the default hyperparameters. Across various runs, the best accuracy that I have been able to achieve is 36.7 mAP. There still seems to a ~1% gap.

zhaohui-yang commented 4 years ago

Yes, I observed that the loss decreasing as usual, and I'm not sure what is the reason. I used multi-GPUs for parallel training and single-GPU for evaluation, the problem exists. I think several reasons may affect :

The scatter and gather operation. For parallel training, data and target are automatically split into n_gpus splits and separately calculated. Gonna check the data shape and the data itself.
Incorrect data-target pair. Though converge, but may lead to the wrong target. (Personally, I don't think this is the reason).
Incorrect forward mode. The forward function contains a resnet50.forward() with eval() mode and edge_model, dp_model with train() mode. If you are familiar with the classification task, if the training mode is quite correct and eval mode is strange, the problem is mainly because of the BN parameters. Gonna split resnet50_irn into three networks: resnet50+edge_model+dp_model.

Thank you for your advice!

adityaarun1 commented 4 years ago

I have checked for point 2 and it is fine. Point 3 can be an issue, but it seems to work fine on a single GPU, so I am not sure what goes wrong if you train on multiple GPUs but test on one.

zhaohui-yang commented 4 years ago

I observed the loss curve. Though single-GPU and multi-GPUs would converge. However, loss of single-GPU converges till ~0.37. The loss of multi-GPUs converges till ~0.44. I think something's incorrect with the training stage. Maybe the point3 or the optimizer. Not sure.

adityaarun1 commented 4 years ago

Yes, I have observed the same. But that still does not explain the big difference in the result in my opinion.

jiwoon-ahn commented 4 years ago

MeanShift layer is somewhat dependent on the batch sizes per gpu, and might cause the problem. https://github.com/jiwoon-ahn/irn/blob/master/net/resnet50_irn.py#L99

jiwoon-ahn commented 4 years ago

@zhaohui-yang @adityaarun1, I have just added data parallelism for IRNet training, and confirmed it still can reproduce the results in the paper.

step.train_cam: Tue Sep 17 14:26:39 2019 Epoch 1/5 step: 0/ 3305 loss:0.6661 imps:0.3 lr: 0.1000 etc:Thu Sep 19 09:26:40 2019 step: 100/ 3305 loss:0.1896 imps:25.8 lr: 0.0973 etc:Tue Sep 17 15:01:01 2019 step: 200/ 3305 loss:0.1140 imps:41.0 lr: 0.0945 etc:Tue Sep 17 14:48:17 2019 step: 300/ 3305 loss:0.0935 imps:51.2 lr: 0.0918 etc:Tue Sep 17 14:44:01 2019 step: 400/ 3305 loss:0.0898 imps:58.5 lr: 0.0890 etc:Tue Sep 17 14:41:53 2019 step: 500/ 3305 loss:0.0826 imps:63.9 lr: 0.0863 etc:Tue Sep 17 14:40:36 2019 step: 600/ 3305 loss:0.0831 imps:68.2 lr: 0.0835 etc:Tue Sep 17 14:39:45 2019 validating ... loss: 0.0757 Epoch 2/5 step: 700/ 3305 loss:0.0773 imps:40.1 lr: 0.0807 etc:Tue Sep 17 14:41:25 2019 step: 800/ 3305 loss:0.0720 imps:70.7 lr: 0.0779 etc:Tue Sep 17 14:40:40 2019 step: 900/ 3305 loss:0.0706 imps:81.0 lr: 0.0751 etc:Tue Sep 17 14:40:06 2019 step: 1000/ 3305 loss:0.0715 imps:86.0 lr: 0.0723 etc:Tue Sep 17 14:39:38 2019 step: 1100/ 3305 loss:0.0708 imps:89.1 lr: 0.0695 etc:Tue Sep 17 14:39:16 2019 step: 1200/ 3305 loss:0.0646 imps:91.2 lr: 0.0666 etc:Tue Sep 17 14:38:57 2019 step: 1300/ 3305 loss:0.0659 imps:92.7 lr: 0.0638 etc:Tue Sep 17 14:38:41 2019 validating ... loss: 0.0647 Epoch 3/5 step: 1400/ 3305 loss:0.0609 imps:54.5 lr: 0.0609 etc:Tue Sep 17 14:39:42 2019 step: 1500/ 3305 loss:0.0570 imps:73.4 lr: 0.0580 etc:Tue Sep 17 14:39:26 2019 step: 1600/ 3305 loss:0.0582 imps:81.5 lr: 0.0551 etc:Tue Sep 17 14:39:11 2019 step: 1700/ 3305 loss:0.0575 imps:86.1 lr: 0.0522 etc:Tue Sep 17 14:38:58 2019 step: 1800/ 3305 loss:0.0576 imps:88.9 lr: 0.0493 etc:Tue Sep 17 14:38:46 2019 step: 1900/ 3305 loss:0.0532 imps:91.0 lr: 0.0463 etc:Tue Sep 17 14:38:36 2019 validating ... loss: 0.0548 Epoch 4/5 step: 2000/ 3305 loss:0.0531 imps:22.7 lr: 0.0433 etc:Tue Sep 17 14:39:16 2019 step: 2100/ 3305 loss:0.0481 imps:66.3 lr: 0.0403 etc:Tue Sep 17 14:39:05 2019 step: 2200/ 3305 loss:0.0457 imps:79.0 lr: 0.0373 etc:Tue Sep 17 14:38:55 2019 step: 2300/ 3305 loss:0.0467 imps:85.0 lr: 0.0343 etc:Tue Sep 17 14:38:46 2019 step: 2400/ 3305 loss:0.0503 imps:88.4 lr: 0.0312 etc:Tue Sep 17 14:38:38 2019 step: 2500/ 3305 loss:0.0480 imps:90.7 lr: 0.0281 etc:Tue Sep 17 14:38:30 2019 step: 2600/ 3305 loss:0.0448 imps:92.4 lr: 0.0249 etc:Tue Sep 17 14:38:23 2019 validating ... loss: 0.0515 Epoch 5/5 step: 2700/ 3305 loss:0.0442 imps:48.7 lr: 0.0217 etc:Tue Sep 17 14:38:53 2019 step: 2800/ 3305 loss:0.0375 imps:72.9 lr: 0.0184 etc:Tue Sep 17 14:38:46 2019 step: 2900/ 3305 loss:0.0418 imps:81.9 lr: 0.0151 etc:Tue Sep 17 14:38:39 2019 step: 3000/ 3305 loss:0.0417 imps:86.7 lr: 0.0117 etc:Tue Sep 17 14:38:33 2019 step: 3100/ 3305 loss:0.0386 imps:89.6 lr: 0.0082 etc:Tue Sep 17 14:38:27 2019 step: 3200/ 3305 loss:0.0384 imps:91.5 lr: 0.0045 etc:Tue Sep 17 14:38:21 2019 step: 3300/ 3305 loss:0.0362 imps:93.0 lr: 0.0003 etc:Tue Sep 17 14:38:16 2019 validating ... loss: 0.0493 step.make_cam: Tue Sep 17 14:38:36 2019 [ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ] step.eval_cam: Tue Sep 17 14:43:39 2019 {'iou': array([0.79388312, 0.43113067, 0.28864309, 0.444585 , 0.36172684, 0.46973761, 0.61380454, 0.54396673, 0.48715231, 0.28845281, 0.57491489, 0.40641602, 0.458491 , 0.49758201, 0.61701023, 0.52529238, 0.42383762, 0.61912264, 0.44892162, 0.49405836, 0.46604461]), 'miou': 0.4883225761592359} step.cam_to_ir_label: Tue Sep 17 14:44:05 2019 [ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 ] step.train_irn: Tue Sep 17 14:59:55 2019 Epoch 1/5 step: 0/ 3305 loss:1.2434 0.2856 4.0134 0.1892 imps:0.7 lr: 0.1000 etc:Wed Sep 18 13:19:38 2019 step: 100/ 3305 loss:0.4934 0.4499 3.8754 0.1043 imps:17.6 lr: 0.0973 etc:Tue Sep 17 15:49:54 2019 step: 200/ 3305 loss:0.4166 0.3639 3.6527 0.1442 imps:22.1 lr: 0.0945 etc:Tue Sep 17 15:39:42 2019 step: 300/ 3305 loss:0.4041 0.3523 3.5186 0.1616 imps:24.2 lr: 0.0918 etc:Tue Sep 17 15:36:20 2019 step: 400/ 3305 loss:0.3994 0.3596 3.3185 0.2086 imps:25.4 lr: 0.0890 etc:Tue Sep 17 15:34:38 2019 step: 500/ 3305 loss:0.3839 0.3394 3.1721 0.2243 imps:26.2 lr: 0.0863 etc:Tue Sep 17 15:33:31 2019 step: 600/ 3305 loss:0.3940 0.3481 3.1202 0.2139 imps:26.8 lr: 0.0835 etc:Tue Sep 17 15:32:51 2019 Epoch 2/5 step: 700/ 3305 loss:0.3802 0.3347 3.0618 0.2176 imps:15.4 lr: 0.0807 etc:Tue Sep 17 15:33:59 2019 step: 800/ 3305 loss:0.3771 0.3315 3.0399 0.2230 imps:23.6 lr: 0.0779 etc:Tue Sep 17 15:33:23 2019 step: 900/ 3305 loss:0.3779 0.3291 3.0268 0.2232 imps:25.9 lr: 0.0751 etc:Tue Sep 17 15:32:57 2019 step: 1000/ 3305 loss:0.3765 0.3284 3.0080 0.2230 imps:27.0 lr: 0.0723 etc:Tue Sep 17 15:32:35 2019 step: 1100/ 3305 loss:0.3739 0.3317 2.9594 0.2122 imps:27.7 lr: 0.0695 etc:Tue Sep 17 15:32:15 2019 step: 1200/ 3305 loss:0.3791 0.3385 2.9742 0.2159 imps:28.1 lr: 0.0666 etc:Tue Sep 17 15:31:59 2019 step: 1300/ 3305 loss:0.3731 0.3293 2.9441 0.2143 imps:28.5 lr: 0.0638 etc:Tue Sep 17 15:31:45 2019 Epoch 3/5 step: 1400/ 3305 loss:0.3746 0.3318 2.9068 0.2132 imps:20.2 lr: 0.0609 etc:Tue Sep 17 15:32:24 2019 step: 1500/ 3305 loss:0.3722 0.3240 2.9324 0.2056 imps:24.6 lr: 0.0580 etc:Tue Sep 17 15:32:13 2019 step: 1600/ 3305 loss:0.3630 0.3207 2.9220 0.2053 imps:26.2 lr: 0.0551 etc:Tue Sep 17 15:32:03 2019 step: 1700/ 3305 loss:0.3734 0.3279 2.8887 0.2145 imps:27.2 lr: 0.0522 etc:Tue Sep 17 15:31:53 2019 step: 1800/ 3305 loss:0.3639 0.3170 2.8827 0.2084 imps:27.8 lr: 0.0493 etc:Tue Sep 17 15:31:42 2019 step: 1900/ 3305 loss:0.3662 0.3194 2.8690 0.2112 imps:28.2 lr: 0.0463 etc:Tue Sep 17 15:31:33 2019 Epoch 4/5 step: 2000/ 3305 loss:0.3596 0.3165 2.8963 0.2054 imps:9.7 lr: 0.0433 etc:Tue Sep 17 15:32:00 2019 step: 2100/ 3305 loss:0.3636 0.3154 2.8559 0.2216 imps:22.8 lr: 0.0403 etc:Tue Sep 17 15:31:52 2019 step: 2200/ 3305 loss:0.3569 0.3119 2.8304 0.2085 imps:25.5 lr: 0.0373 etc:Tue Sep 17 15:31:47 2019 step: 2300/ 3305 loss:0.3651 0.3224 2.8433 0.2046 imps:26.6 lr: 0.0343 etc:Tue Sep 17 15:31:41 2019 step: 2400/ 3305 loss:0.3563 0.3121 2.8420 0.2105 imps:27.3 lr: 0.0312 etc:Tue Sep 17 15:31:35 2019 step: 2500/ 3305 loss:0.3537 0.3078 2.8178 0.2024 imps:27.8 lr: 0.0281 etc:Tue Sep 17 15:31:30 2019 step: 2600/ 3305 loss:0.3619 0.3137 2.8092 0.2042 imps:28.1 lr: 0.0249 etc:Tue Sep 17 15:31:25 2019 Epoch 5/5 step: 2700/ 3305 loss:0.3569 0.3068 2.8103 0.1992 imps:18.2 lr: 0.0217 etc:Tue Sep 17 15:31:45 2019 step: 2800/ 3305 loss:0.3515 0.3021 2.7876 0.2031 imps:24.1 lr: 0.0184 etc:Tue Sep 17 15:31:41 2019 step: 2900/ 3305 loss:0.3555 0.3133 2.7876 0.1998 imps:26.1 lr: 0.0151 etc:Tue Sep 17 15:31:36 2019 step: 3000/ 3305 loss:0.3538 0.3070 2.7692 0.1936 imps:27.1 lr: 0.0117 etc:Tue Sep 17 15:31:31 2019 step: 3100/ 3305 loss:0.3605 0.3197 2.7654 0.1988 imps:27.6 lr: 0.0082 etc:Tue Sep 17 15:31:28 2019 step: 3200/ 3305 loss:0.3464 0.3009 2.7539 0.1878 imps:28.1 lr: 0.0045 etc:Tue Sep 17 15:31:23 2019 step: 3300/ 3305 loss:0.3555 0.3044 2.7294 0.1937 imps:28.4 lr: 0.0003 etc:Tue Sep 17 15:31:19 2019 Analyzing displacements mean ... done. step.make_ins_seg_labels: Tue Sep 17 15:32:01 2019 [ 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ] step.eval_ins_seg: Tue Sep 17 15:43:17 2019 0.5iou: {'ap': array([0.36661056, 0.00547154, 0.574922 , 0.30045993, 0.15886656, 0.5995807 , 0.37568754, 0.67267337, 0.05372927, 0.51445742, 0.17437114, 0.56820096, 0.59092809, 0.47832304, 0.2252956 , 0.07792715, 0.35047058, 0.36736559, 0.4721443 , 0.57766263]), 'map': 0.37525739839692324} step.make_sem_seg_labels: Tue Sep 17 15:43:59 2019 [0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 ] step.eval_sem_seg: Tue Sep 17 15:53:27 2019 0.07191166029597273 0.046077651513210416 0.14463595416636954 0.20396592048661016 {'iou': array([0.88201069, 0.66970496, 0.35053706, 0.77762538, 0.60824307, 0.61367099, 0.80644066, 0.71936881, 0.75847351, 0.35081316, 0.79498655, 0.42315924, 0.7327815 , 0.77665314, 0.76811858, 0.68669326, 0.53031597, 0.81550075, 0.58572881, 0.65244931, 0.60669782]), 'miou': 0.6623796759586297} completed train process

adityaarun1 commented 4 years ago

@jiwoon-ahn Thanks. This helps. 😃

zhaohui-yang commented 4 years ago

@jiwoon-ahn Thanks! Everythink's fine!

jiwoon-ahn / irn

Log files for training #13