open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.07k stars 1.51k forks source link

[Bug] tpvformer train long waiting no log #3003

Open zkailinzhang opened 3 weeks ago

zkailinzhang commented 3 weeks ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

q

Reproduces the problem - code sample

    type='LoadPointsFromFile',
    use_dim=3),
dict(
    seg_3d_dtype='np.uint8',
    type='LoadAnnotations3D',
    with_attr_label=False,
    with_bbox_3d=False,
    with_label_3d=False,
    with_seg_3d=True),
dict(type='SegLabelMapping'),
dict(
    keys=[
        'img',
        'points',
        'pts_semantic_mask',
    ],
    meta_keys=[
        'lidar2img',
    ],
    type='Pack3DDetInputs'),

] vis_backends = [ dict(type='LocalVisBackend'), ] visualizer = dict( name='visualizer', type='Det3DLocalVisualizer', vis_backends=[ dict(type='LocalVisBackend'), ]) work_dir = './work_dirs/tpvformer_8xb1-2x_nus-seg'

/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments feedforward_channels in BaseTransformerLayer has been deprecated, now you should set feedforward_channels and other FFN related arguments to a dict named ffn_cfgs. warnings.warn( /home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments ffn_dropout in BaseTransformerLayer has been deprecated, now you should set ffn_drop and other FFN related arguments to a dict named ffn_cfgs. warnings.warn(

Reproduces the problem - command or script

bash tools/dist_train.sh projects/TPVFormer/configs/tpvformer_8xb1-2x_nus-seg.py 2

Reproduces the problem - error message

    type='LoadPointsFromFile',
    use_dim=3),
dict(
    seg_3d_dtype='np.uint8',
    type='LoadAnnotations3D',
    with_attr_label=False,
    with_bbox_3d=False,
    with_label_3d=False,
    with_seg_3d=True),
dict(type='SegLabelMapping'),
dict(
    keys=[
        'img',
        'points',
        'pts_semantic_mask',
    ],
    meta_keys=[
        'lidar2img',
    ],
    type='Pack3DDetInputs'),

] vis_backends = [ dict(type='LocalVisBackend'), ] visualizer = dict( name='visualizer', type='Det3DLocalVisualizer', vis_backends=[ dict(type='LocalVisBackend'), ]) work_dir = './work_dirs/tpvformer_8xb1-2x_nus-seg'

/home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments feedforward_channels in BaseTransformerLayer has been deprecated, now you should set feedforward_channels and other FFN related arguments to a dict named ffn_cfgs. warnings.warn( /home/zkl/code/det3d_demo/mmdetection3d/projects/TPVFormer/tpvformer/tpvformer_layer.py:69: UserWarning: The arguments ffn_dropout in BaseTransformerLayer has been deprecated, now you should set ffn_drop and other FFN related arguments to a dict named ffn_cfgs. warnings.warn(

Additional information

q

zkailinzhang commented 3 weeks ago

以上为多卡训练 卡住了 改单卡训练也卡主了

07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:lr=2e-05 07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:weight_decay=0.01 07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv2.conv_offset.bias:lr_mult=0.1 07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn2.weight is skipped since its requires_grad=False 07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn2.bias is skipped since its requires_grad=False 07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:lr=2e-05 07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:weight_decay=0.01 07/04 16:52:40 - mmengine - INFO - paramwise_options -- backbone.layer4.2.conv3.weight:lr_mult=0.1 07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn3.weight is skipped since its requires_grad=False 07/04 16:52:40 - mmengine - WARNING - backbone.layer4.2.bn3.bias is skipped since its requires_grad=False /home/zkl/code/det3d_demo/mmdetection3d/mmdet3d/evaluation/functional/kitti_utils/eval.py:10: NumbaDeprecationWarning: The 'nopython' keyword argument was no t supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def get_thresholds(scores: np.ndarray, num_gt, num_sample_pts=41): 07/04 16:52:57 - mmengine - WARNING - The prefix is not set in metric class SegMetric. 07/04 16:52:59 - mmengine - INFO - load backbone. in model from: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth Loads checkpoint by local backend from path: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth 07/04 16:52:59 - mmengine - INFO - load neck. in model from: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth Loads checkpoint by local backend from path: checkpoints/tpvformer_pretrained_fcos3d_r101_dcn.pth 07/04 16:52:59 - mmengine - WARNING - The model and loaded state dict do not match exactly

size mismatch for lateral_convs.0.conv.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.S ize([128, 512, 1, 1]). size mismatch for lateral_convs.0.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for lateral_convs.1.conv.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch. Size([128, 1024, 1, 1]). size mismatch for lateral_convs.1.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for lateral_convs.2.conv.weight: copying a param with shape torch.Size([256, 2048, 1, 1]) from checkpoint, the shape in current model is torch. Size([128, 2048, 1, 1]). size mismatch for lateral_convs.2.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for fpn_convs.0.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]). size mismatch for fpn_convs.0.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for fpn_convs.1.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]). size mismatch for fpn_convs.1.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for fpn_convs.2.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]). size mismatch for fpn_convs.2.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). size mismatch for fpn_convs.3.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size( [128, 128, 3, 3]). size mismatch for fpn_convs.3.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]). unexpected key in source state_dict: fpn_convs.4.conv.weight, fpn_convs.4.conv.bias

07/04 16:52:59 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fil eio.html#file-io 07/04 16:52:59 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 07/04 16:52:59 - mmengine - INFO - Checkpoints will be saved to /home/zkl/code/det3d_demo/mmdetection3d/work_dirs/tpvformer_8xb1-2x_nus-seg.

zkailinzhang commented 3 weeks ago

但是显存一直在变 image image

zkailinzhang commented 3 weeks ago

单卡训练的日志有了, image 先跑一晚上吧 明天试试多卡的