uber-research / UPSNet

UPSNet: A Unified Panoptic Segmentation Network
Other
649 stars 120 forks source link

The Model seems not training for a long while #63

Open Jacob-Jiah opened 5 years ago

Jacob-Jiah commented 5 years ago

Python3.7 Pytorch1.1 Cuda9.0: I am trying to reproduce the training process of the Cityscapes dataset. When I start training process, I receive a lot of warnings: upsnet/../upsnet/models/resnet.py:285: UserWarning: unexpected key "fc.weight" in state_dict warnings.warn('unexpected key "{}" in state_dict'.format(name)) upsnet/../upsnet/models/resnet.py:285: UserWarning: unexpected key "fc.bias" in state_dict warnings.warn('unexpected key "{}" in state_dict'.format(name)) upsnet/../upsnet/models/resnet.py:299: UserWarning: missing keys in state_dict: "{'resnet_backbone.res4.layers.3.bn2.num_batches_tracked', 'resnet_backbone.res4.layers.0.bn3.num_batches_tracked', 'resnet_backbone.res5.layers.0.bn3.num_batches_tracked', 'resnet_backbone.res3.layers.1.bn1.num_batches_tracked', 'resnet_backbone.res3.layers.0.bn3.num_batches_tracked', 'fpn.fpn_p4.bias', 'mask_branch.mask_conv2.0.weight', 'resnet_backbone.res3.layers.3.bn2.num_batches_tracked', 'resnet_backbone.res2.layers.2.bn1.num_batches_tracked', 'resnet_backbone.res4.layers.0.bn1.num_batches_tracked', 'resnet_backbone.res2.layers.1.bn2.num_batches_tracked', 'resnet_backbone.res5.layers.1.bn2.num_batches_tracked', 'resnet_backbone.res2.layers.2.bn3.num_batches_tracked', 'resnet_backbone.res4.layers.2.bn3.num_batches_tracked', 'fpn.fpn_p3.bias', 'mask_branch.mask_conv4.0.weight', 'resnet_backbone.res4.layers.3.bn3.num_batches_tracked', 'resnet_backbone.res4.layers.2.bn1.num_batches_tracked', 'fpn.fpn_p5.weight', 'resnet_backbone.res2.layers.0.bn2.num_batches_tracked', 'fcn_head.fcn_subnet.conv.1.0.conv_offset.bias', 'fpn.fpn_p5_1x1.weight', 'resnet_backbone.res3.layers.1.bn3.num_batches_tracked', 'resnet_backbone.conv1.bn1.num_batches_tracked', 'resnet_backbone.res3.layers.3.bn1.num_batches_tracked', 'fpn.fpn_p5.bias', 'mask_branch.mask_deconv1.0.weight', 'resnet_backbone.res4.layers.5.bn1.num_batches_tracked', 'resnet_backbone.res5.layers.2.bn1.num_batches_tracked', 'resnet_backbone.res3.layers.2.bn3.num_batches_tracked', 'resnet_backbone.res3.layers.0.bn1.num_batches_tracked', 'fcn_head.fcn_subnet.conv.0.0.conv_offset.weight', 'fcn_head.fcn_subnet.conv.0.0.conv.weight', 'fpn.fpn_p3.weight', 'resnet_backbone.res3.layers.0.bn2.num_batches_tracked', 'resnet_backbone.res2.layers.1.bn3.num_batches_tracked', 'fpn.fpn_p3_1x1.bias', 'mask_branch.mask_deconv1.0.bias', 'fpn.fpn_p4.weight', 'fpn.fpn_p2_1x1.bias', 'fpn.fpn_p4_1x1.weight', 'resnet_backbone.res3.layers.0.downsample.1.num_batches_tracked', 'rpn.cls_score.bias', 'rcnn.fc6.0.bias', 'mask_branch.mask_conv3.0.weight', 'resnet_backbone.res2.layers.0.bn3.num_batches_tracked', 'rpn.conv_proposal.0.bias', 'resnet_backbone.res4.layers.4.bn3.num_batches_tracked', 'fcn_head.fcn_subnet.conv.0.0.conv.bias', 'resnet_backbone.res2.layers.0.bn1.num_batches_tracked', 'rcnn.fc7.0.weight', 'rcnn.bbox_pred.bias', 'resnet_backbone.res4.layers.0.downsample.1.num_batches_tracked', 'fcn_head.fcn_subnet.conv.1.0.conv.bias', 'resnet_backbone.res5.layers.0.bn1.num_batches_tracked', 'mask_branch.mask_conv2.0.bias', 'mask_branch.mask_conv1.0.bias', 'resnet_backbone.res4.layers.5.bn2.num_batches_tracked', 'fcn_head.fcn_subnet.conv.1.0.conv.weight', 'resnet_backbone.res4.layers.4.bn1.num_batches_tracked', 'rpn.bbox_pred.weight', 'mask_branch.mask_score.weight', 'resnet_backbone.res2.layers.1.bn1.num_batches_tracked', 'resnet_backbone.res4.layers.0.bn2.num_batches_tracked', 'rpn.conv_proposal.0.weight', 'rcnn.fc7.0.bias', 'resnet_backbone.res5.layers.1.bn3.num_batches_tracked', 'rcnn.fc6.0.weight', 'resnet_backbone.res3.layers.3.bn3.num_batches_tracked', 'resnet_backbone.res4.layers.3.bn1.num_batches_tracked', 'resnet_backbone.res4.layers.1.bn2.num_batches_tracked', 'mask_branch.mask_conv1.0.weight', 'fcn_head.fcn_subnet.conv.0.0.conv_offset.bias', 'resnet_backbone.res2.layers.0.downsample.1.num_batches_tracked', 'resnet_backbone.res3.layers.2.bn1.num_batches_tracked', 'mask_branch.mask_conv3.0.bias', 'fpn.fpn_p2.bias', 'mask_branch.mask_score.bias', 'resnet_backbone.res4.layers.1.bn3.num_batches_tracked', 'fpn.fpn_p2_1x1.weight', 'fpn.fpn_p4_1x1.bias', 'resnet_backbone.res3.layers.2.bn2.num_batches_tracked', 'mask_branch.mask_conv4.0.bias', 'fcn_head.score.bias', 'resnet_backbone.res5.layers.0.downsample.1.num_batches_tracked', 'resnet_backbone.res5.layers.2.bn3.num_batches_tracked', 'fpn.fpn_p2.weight', 'resnet_backbone.res3.layers.1.bn2.num_batches_tracked', 'resnet_backbone.res4.layers.1.bn1.num_batches_tracked', 'resnet_backbone.res5.layers.0.bn2.num_batches_tracked', 'resnet_backbone.res4.layers.5.bn3.num_batches_tracked', 'resnet_backbone.res4.layers.2.bn2.num_batches_tracked', 'resnet_backbone.res2.layers.2.bn2.num_batches_tracked', 'resnet_backbone.res4.layers.4.bn2.num_batches_tracked', 'rcnn.bbox_pred.weight', 'fcn_head.score.weight', 'fcn_head.fcn_subnet.conv.1.0.conv_offset.weight', 'rpn.bbox_pred.bias', 'fpn.fpn_p3_1x1.weight', 'rcnn.cls_score.weight', 'resnet_backbone.res5.layers.1.bn1.num_batches_tracked', 'rcnn.cls_score.bias', 'fpn.fpn_p5_1x1.bias', 'resnet_backbone.res5.layers.2.bn2.num_batches_tracked', 'rpn.cls_score.weight'}" warnings.warn('missing keys in state_dict: "{}"'.format(missing)) /usr/local/lib/python3.7/site-packages/torch/nn/functional.py:1386: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") /usr/local/lib/python3.7/site-packages/torch/nn/functional.py:2457: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") And the training process does not move at all. (The tensor-board event file remain empty all the time) What I can do? Thanks so much!

Jacob-Jiah commented 5 years ago

When there is only one GPU, the code works perfect, but when the number of GPU>1, it will definitely get stuck. I have no clue why that happens, because other codes work well with multi-GPU process on my server.

Jacob-Jiah commented 5 years ago

The problem may appear in line 284 loss.backward(). As soon as I process this line, all gpus will directly go to 100% and get stuck.

YuwenXiong commented 5 years ago

Please make sure you are not using horovod if you haven’t setup it correctly

ChengJianjia commented 5 years ago

Have you solved this problem? I have the same problem

AMzhanghan commented 4 years ago

Please make sure you are not using horovod if you haven’t setup it correctly

I setup use_horovod: false, but get the same problem!!!

AMzhanghan commented 4 years ago

Have you solved this problem? I have the same problem

Have you solved it ?