tusen-ai / simpledet

A Simple and Versatile Framework for Object Detection and Instance Recognition
Apache License 2.0
3.08k stars 486 forks source link

Segmentation fault:11 #336

Open dongzhenguo2016 opened 4 years ago

dongzhenguo2016 commented 4 years ago

`06-16 23:55:24 Epoch[0] Batch [3590] Iter: 3590/26046 Lr: 0.00500 Speed: 9.42 samples/sec Train-RpnAcc=0.997272, RpnL1=0.165742, RcnnAcc_1st=0.985713, RcnnL1_1st=0.604444, RcnnAcc_2nd=0.986624, RcnnL1_2nd=1.236113, RcnnAcc_3rd=0.984117, RcnnL1_3rd=1.859310,
06-16 23:55:28 Epoch[0] Batch [3600] Iter: 3600/26046 Lr: 0.00500 Speed: 9.50 samples/sec Train-RpnAcc=0.997278, RpnL1=0.165552, RcnnAcc_1st=0.985734, RcnnL1_1st=0.603507, RcnnAcc_2nd=0.986646, RcnnL1_2nd=1.234198, RcnnAcc_3rd=0.984152, RcnnL1_3rd=1.856836,

Segmentation fault: 11` I recently encountered the same error while training cascade_r101v1_fpn_1x, how can I solve it? Feel so strange. My platform is ubuntu 16.04 maxnet-cu100 1.6.0

RogerChern commented 4 years ago

The proposal operator has some problems when handling invalid input, which leads to a segment fault when the input contains NaN. This means your Cascade R-CNN heads or the RPN head has blown up. You can try to lower the learning for your task.

On Wed, Jun 17, 2020 at 10:21 AM dongzhenguo2016 notifications@github.com wrote:

`06-16 23:55:24 Epoch[0] Batch [3590] Iter: 3590/26046 Lr: 0.00500 Speed: 9.42 samples/sec Train-RpnAcc=0.997272, RpnL1=0.165742, RcnnAcc_1st=0.985713, RcnnL1_1st=0.604444, RcnnAcc_2nd=0.986624, RcnnL1_2nd=1.236113, RcnnAcc_3rd=0.984117, RcnnL1_3rd=1.859310, 06-16 23:55:28 Epoch[0] Batch [3600] Iter: 3600/26046 Lr: 0.00500 Speed: 9.50 samples/sec Train-RpnAcc=0.997278, RpnL1=0.165552, RcnnAcc_1st=0.985734, RcnnL1_1st=0.603507, RcnnAcc_2nd=0.986646, RcnnL1_2nd=1.234198, RcnnAcc_3rd=0.984152, RcnnL1_3rd=1.856836,

Segmentation fault: 11` I recently encountered the same error while training cascade_r101v1_fpn_1x, how can I solve it? Feel so strange. My platform is ubuntu 16.04 maxnet-cu100 1.6.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TuSimple/simpledet/issues/336, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGODH7XMRMW2K2YDBPUCGTRXASD7ANCNFSM4OAFBOJA .

dongzhenguo2016 commented 4 years ago

The proposal operator has some problems when handling invalid input, which leads to a segment fault when the input contains NaN. This means your Cascade R-CNN heads or the RPN head has blown up. You can try to lower the learning for your task. On Wed, Jun 17, 2020 at 10:21 AM dongzhenguo2016 @.***> wrote: 06-16 23:55:24 Epoch[0] Batch [3590] Iter: 3590/26046 Lr: 0.00500 Speed: 9.42 samples/sec Train-RpnAcc=0.997272, RpnL1=0.165742, RcnnAcc_1st=0.985713, RcnnL1_1st=0.604444, RcnnAcc_2nd=0.986624, RcnnL1_2nd=1.236113, RcnnAcc_3rd=0.984117, RcnnL1_3rd=1.859310, 06-16 23:55:28 Epoch[0] Batch [3600] Iter: 3600/26046 Lr: 0.00500 Speed: 9.50 samples/sec Train-RpnAcc=0.997278, RpnL1=0.165552, RcnnAcc_1st=0.985734, RcnnL1_1st=0.603507, RcnnAcc_2nd=0.986646, RcnnL1_2nd=1.234198, RcnnAcc_3rd=0.984152, RcnnL1_3rd=1.856836, Segmentation fault: 11 I recently encountered the same error while training cascade_r101v1_fpn_1x, how can I solve it? Feel so strange. My platform is ubuntu 16.04 maxnet-cu100 1.6.0 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#336>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGODH7XMRMW2K2YDBPUCGTRXASD7ANCNFSM4OAFBOJA .

Yes, reducing the learning rate can indeed solve this problem. But after adjusting the learning rate from 0.01 to 0.001, I found that mAP dropped by 1 point. This is not the result I want. Therefore, I think that the local optimal solution obtained after the learning rate is reduced is not as good as the local optimal solution obtained when the previous learning rate is large. Below is my code after adjusting the learning rate: class OptimizeParam: class optimizer: type = "sgd" lr = 0.001 / 8 * len(KvstoreParam.gpus) * KvstoreParam.batch_image momentum = 0.9 wd = 0.0001 clip_gradient = None