Open Shualite opened 3 years ago
And I changed version of torch and torchvision, still met this problem.
My configuration as follows: PyTorch version: 1.2.0 Is debug build: No CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.5 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.10.2
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti
Nvidia driver version: 450.66 cuDNN version: Could not collect
Versions of relevant libraries: [pip3] numpy==1.17.0 [pip3] torch==1.2.0 [pip3] torchvision==0.4.0 [conda] Could not collect Pillow (7.1.2) 2020-11-12 20:34:09,466 maskrcnn_benchmark INFO: Loaded configuration file configs/arpn_E2E/e2e_rrpn_R_50_C4_1x_train_AFPN_RT_LERB_Spotter.yaml
error:
2020-11-12 20:35:05,741 maskrcnn_benchmark.trainer INFO: eta: 3:13:40 iter: 80 loss: nan (nan) loss_classifier: nan (nan) loss_box_reg: nan (nan) loss_rec: nan (nan) loss_objectness: 0.0990 (0.0989) loss_rpn_box_reg: 0.0297 (0.0431) time: 0.1000 (0.1163) data: 0.0023 (0.0209) lr: 0.000009 max mem: 1273
INFO:maskrcnn_benchmark.trainer:eta: 3:13:40 iter: 80 loss: nan (nan) loss_classifier: nan (nan) loss_box_reg: nan (nan) loss_rec: nan (nan) loss_objectness: 0.0990 (0.0989) loss_rpn_box_reg: 0.0297 (0.0431) time: 0.1000 (0.1163) data: 0.0023 (0.0209) lr: 0.000009 max mem: 1273
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
2020-11-12 20:35:07,004 maskrcnn_benchmark.trainer INFO: eta: 3:15:29 iter: 90 loss: nan (nan) loss_classifier: nan (nan) loss_box_reg: nan (nan) loss_rec: nan (nan) loss_objectness: 0.0991 (0.0989) loss_rpn_box_reg: 0.0268 (0.0408) time: 0.1253 (0.1174) data: 0.0008 (0.0193) lr: 0.000009 max mem: 1273
INFO:maskrcnn_benchmark.trainer:eta: 3:15:29 iter: 90 loss: nan (nan) loss_classifier: nan (nan) loss_box_reg: nan (nan) loss_rec: nan (nan) loss_objectness: 0.0991 (0.0989) loss_rpn_box_reg: 0.0268 (0.0408) time: 0.1253 (0.1174) data: 0.0008 (0.0193) lr: 0.000009 max mem: 1273
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Traceback (most recent call last):
File "tools/train_net.py", line 202, in
And I found that many layers lost pre-trained parameters:
2020-11-12 22:02:01,110 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: res5_2_branch2c_bn_s mapped name: layer4.2.bn3.weight 2020-11-12 22:02:01,110 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: res5_2_branch2c_w mapped name: layer4.2.conv3.weight 2020-11-12 22:02:01,110 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: res_conv1_bn_b mapped name: bn1.bias 2020-11-12 22:02:01,110 maskrcnn_benchmark.utils.c2_model_loading INFO: C2 name: res_conv1_bn_s mapped name: bn1.weight 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn1.bias loaded from layer1.0.bn1.bias of shape (64,) torch.Si$e([64]) & torch.Size([64]) backbone.body.layer1.0.bn1.bias loaded from layer1.0.bn1.bias of shape (64,) torch.Size([64]) & torch.Size([64]) 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn1.running_mean discard it... 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn1.running_mean discard it... None 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn1.running_var discard it... 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn1.running_var discard it... None 2020-11-12 22:02:01,121 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn1.weight loaded from layer1.0.bn1.weight of shape (64,) torch.Si$e([64]) & torch.Size([64]) backbone.body.layer1.0.bn1.weight loaded from layer1.0.bn1.weight of shape (64,) torch.Size([64]) & torch.Size([64]) 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn2.bias loaded from layer1.0.bn2.bias of shape (64,) torch.Si$e([64]) & torch.Size([64]) backbone.body.layer1.0.bn2.bias loaded from layer1.0.bn2.bias of shape (64,) torch.Size([64]) & torch.Size([64]) 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn2.running_mean discard it... 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn2.running_mean discard it... None 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn2.running_var discard it... 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn2.running_var discard it... None 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn2.weight loaded from layer1.0.bn2.weight of shape (64,) torch.Si$e([64]) & torch.Size([64]) backbone.body.layer1.0.bn2.weight loaded from layer1.0.bn2.weight of shape (64,) torch.Size([64]) & torch.Size([64]) 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn3.bias loaded from layer1.0.bn3.bias of shape (256,) torch.Si$e([256]) & torch.Size([256]) backbone.body.layer1.0.bn3.bias loaded from layer1.0.bn3.bias of shape (256,) torch.Size([256]) & torch.Size([256]) 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn3.running_mean discard it... 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn3.running_mean discard it... None 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn3.running_var discard it... 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: We don't have key:backbone.body.layer1.0.bn3.running_var discard it... None 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn3.weight loaded from layer1.0.bn3.weight of shape (256,) torch.Si$e([256]) & torch.Size([256]) backbone.body.layer1.0.bn3.weight loaded from layer1.0.bn3.weight of shape (256,) torch.Size([256]) & torch.Size([256]) 2020-11-12 22:02:01,122 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.conv1.weight
......
Key backbone.body.layer3.5.bn3.running_var not loaded... Key backbone.body.layer4.0.downsample.1.running_mean not loaded... Key backbone.body.layer4.0.downsample.1.running_var not loaded... Key backbone.body.layer4.0.bn1.running_mean not loaded... Key backbone.body.layer4.0.bn1.running_var not loaded... Key backbone.body.layer4.0.bn2.running_mean not loaded... Key backbone.body.layer4.0.bn2.running_var not loaded... Key backbone.body.layer4.0.bn3.running_mean not loaded... Key backbone.body.layer4.0.bn3.running_var not loaded... Key backbone.body.layer4.1.bn1.running_mean not loaded... Key backbone.body.layer4.1.bn1.running_var not loaded... Key backbone.body.layer4.1.bn2.running_mean not loaded... Key backbone.body.layer4.1.bn2.running_var not loaded... Key backbone.body.layer4.1.bn3.running_mean not loaded... Key backbone.body.layer4.1.bn3.running_var not loaded... Key backbone.body.layer4.2.bn1.running_mean not loaded... Key backbone.body.layer4.2.bn1.running_var not loaded... Key backbone.body.layer4.2.bn2.running_mean not loaded... Key backbone.body.layer4.2.bn2.running_var not loaded... Key backbone.body.layer4.2.bn3.running_mean not loaded... Key backbone.body.layer4.2.bn3.running_var not loaded... Key backbone.fpn.fpn_inner1.weight not loaded... Key backbone.fpn.fpn_inner1.bias not loaded... Key backbone.fpn.fpn_layer1.weight not loaded... Key backbone.fpn.fpn_layer1.bias not loaded... Key backbone.fpn.fpn_inner2.weight not loaded... Key backbone.fpn.fpn_inner2.bias not loaded... Key backbone.fpn.fpn_layer2.weight not loaded... Key backbone.fpn.fpn_layer2.bias not loaded... Key backbone.fpn.fpn_inner3.weight not loaded... Key backbone.fpn.fpn_inner3.bias not loaded... Key backbone.fpn.fpn_layer3.weight not loaded... Key backbone.fpn.fpn_layer3.bias not loaded... Key backbone.fpn.fpn_inner4.weight not loaded... Key backbone.fpn.fpn_inner4.bias not loaded... Key backbone.fpn.fpn_layer4.weight not loaded... Key backbone.fpn.fpn_layer4.bias not loaded... Key rpn.anchor_generator.cell_anchors.0 not loaded... Key rpn.anchor_generator.cell_anchors.1 not loaded... Key rpn.anchor_generator.cell_anchors.2 not loaded... Key rpn.anchor_generator.cell_anchors.3 not loaded... Key rpn.anchor_generator.cell_anchors.4 not loaded... Key rpn.head.conv.weight not loaded... Key rpn.head.conv.bias not loaded... Key rpn.head.cls_logits.weight not loaded... Key rpn.head.cls_logits.bias not loaded... Key rpn.head.bbox_pred.weight not loaded... Key rpn.head.bbox_pred.bias not loaded...
@Shualite It seems these errors come from your data. Plz go and check if it can be corrected.
@Shualite Also, we recommend to use a 1.0.0 version of Pytorch, other version may not work well with the code.
@Shualite It seems these errors come from your data. Plz go and check if it can be corrected.
Yeah, I found this problem is caused by lenth of alphabet file. Because I'm using the default ALPHABET url in the configuration file. RNN output become T=2. After I modified the file path, the above problem was resolved.
I found a new problem. When I was training RCNN (the second step regression), the return GT was always zero.
maskrcnn_benchmark/modeling/roi_heads/rbox_head/loss.py: --line 312 ipdb> n /media/tongji/data/fsy_scenetext/RRPN_plusplus/maskrcnn_benchmark/modeling/roi_heads/rbox_head/loss.py(316)call() 315 else: 316 box_loss = smooth_l1_loss( 317 box_regression_pos,
ipdb> l 311 size_average=False, 312 beta=1, 313 weight=high_dmasks_pos.float()[:, None] 314 ) 315 else: 316 box_loss = smooth_l1_loss( 317 box_regression_pos, 318 regression_targets_pos, 319 size_average=False, 320 beta=1 321 )
ipdb> print(box_regression_pos)
tensor([[-0.0042, -0.0464, -0.0055, -0.0232, -0.0069],
[ 0.0325, -0.0438, -0.0127, -0.0039, 0.0378],
[ 0.0108, -0.0045, 0.0006, 0.0056, 0.0267]], device='cuda:0',
grad_fn=
This will cause 'loss_box_reg' very little. 2020-11-13 14:36:15,204 maskrcnn_benchmark.trainer INFO: eta: 4:32:18 iter: 10 loss: 4.5826 (4.6576) loss_classifier: 0.7105 (0.7096) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.7542 (3.8260) loss_objectness: 0.0995 (0.0994) loss_rpn_box_reg: 0.0113 (0.0226) time: 0.0939 (0.1634) data: 0.0006 (0.0700) lr: 0.000007 max mem: 1273 2020-11-13 14:36:16,071 maskrcnn_benchmark.trainer INFO: eta: 3:28:25 iter: 20 loss: 4.6975 (4.7066) loss_classifier: 0.7072 (0.7026) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.8854 (3.8822) loss_objectness: 0.0995 (0.0993) loss_rpn_box_reg: 0.0153 (0.0225) time: 0.0871 (0.1251) data: 0.0006 (0.0371) lr: 0.000007 max mem: 1274 2020-11-13 14:36:16,984 maskrcnn_benchmark.trainer INFO: eta: 3:09:37 iter: 30 loss: 4.6975 (4.6865) loss_classifier: 0.6683 (0.6874) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.8739 (3.8647) loss_objectness: 0.0993 (0.0991) loss_rpn_box_reg: 0.0172 (0.0354) time: 0.0876 (0.1138) data: 0.0007 (0.0267) lr: 0.000007 max mem: 1274 2020-11-13 14:36:17,891 maskrcnn_benchmark.trainer INFO: eta: 2:59:57 iter: 40 loss: 4.6243 (4.6463) loss_classifier: 0.6452 (0.6722) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.7979 (3.8336) loss_objectness: 0.0988 (0.0989) loss_rpn_box_reg: 0.0455 (0.0415) time: 0.0901 (0.1080) data: 0.0023 (0.0213) lr: 0.000008 max mem: 1274 2020-11-13 14:36:18,803 maskrcnn_benchmark.trainer INFO: eta: 2:54:20 iter: 50 loss: 4.4961 (4.6086) loss_classifier: 0.6044 (0.6558) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.7370 (3.8146) loss_objectness: 0.0990 (0.0990) loss_rpn_box_reg: 0.0328 (0.0392) time: 0.0900 (0.1047) data: 0.0023 (0.0181) lr: 0.000008 max mem: 1274 2020-11-13 14:36:19,715 maskrcnn_benchmark.trainer INFO: eta: 2:50:35 iter: 60 loss: 4.3902 (4.5683) loss_classifier: 0.5794 (0.6398) _loss_boxreg: 0.0000 (0.0000) loss_rec: 3.7128 (3.7914) loss_objectness: 0.0993 (0.0990) loss_rpn_box_reg: 0.0229 (0.0381) time: 0.0901 (0.1024) data: 0.0024 (0.0160) lr: 0.000008 max mem: 1274 2020-11-13 14:36:20,621 maskrcnn_benchmark.trainer INFO: eta: 2:47:46 iter: 70 loss: 4.2989 (4.5168) loss_classifier: 0.5423 (0.6246) loss_box_reg: 0.0000 (0.0000) loss_rec: 3.6038 (3.7538) loss_objectness: 0.0990 (0.0990) loss_rpn_box_reg: 0.0259 (0.0393) time: 0.0903 (0.1007) data: 0.0026 (0.0145) lr: 0.000009 max mem: 1274 2020-11-13 14:36:21,518 maskrcnn_benchmark.trainer INFO: eta: 2:45:26 iter: 80 loss: 4.1664 (4.4592) loss_classifier: 0.5226 (0.6094) loss_box_reg: 0.0000 (0.0000) loss_rec: 3.5213 (3.7108) loss_objectness: 0.0987 (0.0989) loss_rpn_box_reg: 0.0476 (0.0400) time: 0.0903 (0.0993) data: 0.0024 (0.0133) lr: 0.000009 max mem: 1274 2020-11-13 14:36:22,418 maskrcnn_benchmark.trainer INFO: eta: 2:43:42 iter: 90 loss: 3.9411 (4.3923) loss_classifier: 0.4838 (0.5933) loss_box_reg: 0.0000 (0.0000) loss_rec: 3.2998 (3.6614) loss_objectness: 0.0991 (0.0990) loss_rpn_box_reg: 0.0293 (0.0385) time: 0.0893 (0.0983) data: 0.0024 (0.0124) lr: 0.000009 max mem: 1274
My computer configuration: PyTorch version: 1.0.1 Is debug build: No CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.5 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.10.2
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti
Nvidia driver version: 450.66 cuDNN version: Could not collect
Versions of relevant libraries: [pip] Could not collect [conda] Could not collect Pillow (7.1.2) 2020-11-12 20:16:26,367 maskrcnn_benchmark INFO: Loaded configuration file configs/arpn_E2E/e2e_rrpn_R_50_C4_1x_train_AFPN_RT_LERB_Spotter.yaml.
error: Database: ['IC15'] 861 2020-11-12 20:13:11,100 maskrcnn_benchmark.trainer INFO: Start training 2020-11-12 20:13:12,625 maskrcnn_benchmark.trainer INFO: eta: 4:13:50 iter: 10 loss: 0.8722 (nan) loss_classifier: 0.7357 (nan) loss_box_reg: 0.0000 (nan) loss_rec: 0.0047 (nan) loss_objectness: 0.0988 (0.0985) loss_rpn_box_reg: 0.0330 (0.0599) time: 0.0985 (0.1523) data: 0.0009 (0.0536) lr: 0.000007 max mem: 1632 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. 2020-11-12 20:13:13,538 maskrcnn_benchmark.trainer INFO: eta: 3:23:03 iter: 20 loss: 0.8722 (nan) loss_classifier: 0.7357 (nan) loss_box_reg: 0.0000 (nan) loss_rec: 0.0047 (nan) loss_objectness: 0.0988 (0.0985) loss_rpn_box_reg: 0.0374 (0.0529) time: 0.0908 (0.1219) data: 0.0007 (0.0296) lr: 0.000007 max mem: 1632 INFO:maskrcnn_benchmark.trainer:eta: 3:23:03 iter: 20 loss: 0.8722 (nan) loss_classifier: 0.7357 (nan) loss_box_reg: 0.0000 (nan) loss_rec: 0.0047 (nan) loss_objectness: 0.0988 (0.0985) loss_rpn_box_reg: 0.0374 (0.0529) time: 0.0908 (0.1219) data: 0.0007 (0.0296) lr: 0.000007 max mem: 1632 WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. Traceback (most recent call last): File "tools/train_net.py", line 202, in
main()
File "tools/train_net.py", line 195, in main
model = train(cfg, args.local_rank, args.distributed, args.resume, args.config_file)
File "tools/train_net.py", line 94, in train
config_file=config_file
File "/media/tongji/data/fsy_scenetext/RRPN_plusplus/maskrcnn_benchmark/engine/trainer.py", line 84, in do_train
optimizer.step()
File "/home/tongji/anaconda3/envs/rrpnpytorch/lib/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul(momentum).add_(1 - dampening, d_p)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [62,0,0], thread: [96,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [62,0,0], thread: [97,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [62,0,0], thread: [98,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.