open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.57k stars 9.46k forks source link

multiple node inference error #8353

Closed rdj94 closed 2 years ago

rdj94 commented 2 years ago

Hello, First, I trained model mask-rcnn through the following config and I tested mmdetection's multiple nodes inference.

dataset_type = 'CocoDataset'
data_root = '/home/data/warsaw/'
CLASSES = ('person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
           'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
           'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
           'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
           'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
           'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
           'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
           'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
           'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
           'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
           'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
           'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven',
           'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
           'scissors', 'teddy bear', 'hair drier', 'toothbrush')
model = dict(
    type='MaskRCNN',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        num_outs=5),
    rpn_head=dict(
        type='RPNHead',
        in_channels=256,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
            ratios=[0.5, 1.0, 2.0],
            strides=[4, 8, 16, 32, 64]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[1.0, 1.0, 1.0, 1.0]),
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
    roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        bbox_head=dict(
            type='Shared2FCBBoxHead',
            in_channels=256,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=80,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
            loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
        mask_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
            out_channels=256,
            featmap_strides=[4, 8, 16, 32]),
        mask_head=dict(
            type='FCNMaskHead',
            num_convs=4,
            in_channels=256,
            conv_out_channels=256,
            num_classes=80,
            loss_mask=dict(
                type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
    train_cfg=dict(
        rpn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.7,
                neg_iou_thr=0.3,
                min_pos_iou=0.3,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=256,
                pos_fraction=0.5,
                neg_pos_ub=-1,
                add_gt_as_proposals=False),
            allowed_border=-1,
            pos_weight=-1,
            debug=False),
        rpn_proposal=dict(
            nms_pre=2000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            assigner=dict(
                type='MaxIoUAssigner',
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
                match_low_quality=True,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
                num=512,
                pos_fraction=0.25,
                neg_pos_ub=-1,
                add_gt_as_proposals=True),
            mask_size=28,
            pos_weight=-1,
            debug=False)),
    test_cfg=dict(
        rpn=dict(
            nms_pre=1000,
            max_per_img=1000,
            nms=dict(type='nms', iou_threshold=0.7),
            min_bbox_size=0),
        rcnn=dict(
            score_thr=0.05,
            nms=dict(type='nms', iou_threshold=0.5),
            max_per_img=100,
            mask_thr_binary=0.5)))
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type='CocoDataset',
        ann_file='/home/data/warsaw/train/train.json',
        img_prefix='/home/data/warsaw/train/images',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
            dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(
                type='Collect',
                keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks'])
        ]),
    val=dict(
        type='CocoDataset',
        ann_file='/home/data/warsaw/val/val.json',
        img_prefix='/home/data/warsaw/val/images',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CocoDataset',
        ann_file='/home/data/warsaw/test/test.json',
        img_prefix='/home/data/warsaw/test/images/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(metric=['bbox', 'segm'])
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
resume_from = None
workflow = [('train', 1)]
work_dir = './work_dirs/mask_warsaw_runtime'
auto_resume = False

when I set up a distributed environment as follows and run it as follows on the master and slave nodes, I get an error as follows.

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_test.sh ./configs/_base_/mask_warsaw_runtime.py ./checkpoints/mask.pth 1 --out ./results.pkl
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : tools/test.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 192.168.0.3:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_v2alh_eb/none_bqwcu85w
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib64/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=192.168.0.3
  master_port=29500
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_v2alh_eb/none_bqwcu85w/attempt_0/0/error.json
loading annotations into memory...
Done (t=0.73s)
creating index...
index created!
mca-System-Product-Name:404:404 [0] NCCL INFO Bootstrap : Using [0]enp0s31f6:192.168.0.3<0>
mca-System-Product-Name:404:404 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mca-System-Product-Name:404:404 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mca-System-Product-Name:404:404 [0] NCCL INFO NET/Socket : Using [0]enp0s31f6:192.168.0.3<0>
mca-System-Product-Name:404:404 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
mca-System-Product-Name:404:456 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC

mca-System-Product-Name:404:456 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order

mca-System-Product-Name:404:456 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 3, falling back to simple order

mca-System-Product-Name:404:456 [0] NCCL INFO Channel 00/02 :    0   1   2
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 01/02 :    0   1   2
mca-System-Product-Name:404:456 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
mca-System-Product-Name:404:456 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->2|2->0->-1/-1/-1
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 00 : 2[3b000] -> 0[1000] [receive] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 00 : 0[1000] -> 1[86000] [send] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 00 : 1[86000] -> 0[1000] [receive] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 01 : 2[3b000] -> 0[1000] [receive] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 01 : 0[1000] -> 1[86000] [send] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO Channel 01 : 0[1000] -> 2[3b000] [send] via NET/Socket/0
mca-System-Product-Name:404:456 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
mca-System-Product-Name:404:456 [0] NCCL INFO comm 0x7fe538002e10 rank 0 nranks 3 cudaDev 0 busId 1000 - Init COMPLETE
mca-System-Product-Name:404:404 [0] NCCL INFO Launch mode Parallel
load checkpoint from local path: ./checkpoints/mask.pth
[                                                  ] 0/10000, elapsed: 0s, ETA:/usr/local/lib64/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[>>>>>>>>>>>>>>>>>>>>>>>>] 10002/10000, 31.2 task/s, elapsed: 320s, ETA:     0s
mca-System-Product-Name:404:457 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
mca-System-Product-Name:404:457 [0] NCCL INFO transport/net_socket.cc:405 -> 2
mca-System-Product-Name:404:457 [0] NCCL INFO include/net.h:28 -> 2
mca-System-Product-Name:404:457 [0] NCCL INFO transport/net.cc:357 -> 2
mca-System-Product-Name:404:457 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 404) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.3
  master_port=29500
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_v2alh_eb/none_bqwcu85w/attempt_1/0/error.json

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_test.sh ./configs/_base_/mask_warsaw_runtime.py ./checkpoints/latest.pth 2 --out ./results.pkl
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : tools/test.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 192.168.0.3:29500
  rdzv_configs     : {'rank': 1, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_imupj98y/none_wmuo4loq
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/usr/local/lib64/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=192.168.0.3
  master_port=29500
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[1, 2]
  global_ranks=[1, 2]
  role_world_sizes=[3, 3]
  global_world_sizes=[3, 3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_imupj98y/none_wmuo4loq/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_imupj98y/none_wmuo4loq/attempt_0/1/error.json
loading annotations into memory...
loading annotations into memory...
Done (t=0.91s)
creating index...
Done (t=0.91s)
creating index...
index created!
index created!
mca-WS-C621E-SAGE-Series:852:852 [1] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.0.5<0>
mca-WS-C621E-SAGE-Series:853:853 [0] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.0.5<0>
mca-WS-C621E-SAGE-Series:852:852 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mca-WS-C621E-SAGE-Series:853:853 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mca-WS-C621E-SAGE-Series:852:852 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mca-WS-C621E-SAGE-Series:853:853 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mca-WS-C621E-SAGE-Series:853:853 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.0.5<0>
mca-WS-C621E-SAGE-Series:852:852 [1] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.0.5<0>
mca-WS-C621E-SAGE-Series:853:853 [0] NCCL INFO Using network Socket
mca-WS-C621E-SAGE-Series:852:852 [1] NCCL INFO Using network Socket
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC

mca-WS-C621E-SAGE-Series:853:902 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order

mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC

mca-WS-C621E-SAGE-Series:853:902 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 2, falling back to simple order

mca-WS-C621E-SAGE-Series:852:903 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order

mca-WS-C621E-SAGE-Series:852:903 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 2, falling back to simple order

mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Trees [0] -1/-1/-1->2->1|1->2->-1/-1/-1 [1] 0/-1/-1->2->1|1->2->0/-1/-1
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/64
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->-1|-1->1->2/-1/-1
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,ffff0000
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Channel 00 : 0[1000] -> 1[86000] [receive] via NET/Socket/0
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Channel 00 : 1[86000] -> 2[3b000] via direct shared memory
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Channel 00 : 2[3b000] -> 0[1000] [send] via NET/Socket/0
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Channel 00 : 2[3b000] -> 1[86000] via direct shared memory
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Channel 00 : 1[86000] -> 0[1000] [send] via NET/Socket/0
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Channel 01 : 0[1000] -> 1[86000] [receive] via NET/Socket/0
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO Channel 01 : 1[86000] -> 2[3b000] via direct shared memory
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Channel 01 : 2[3b000] -> 0[1000] [send] via NET/Socket/0
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Channel 01 : 0[1000] -> 2[3b000] [receive] via NET/Socket/0
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO Channel 01 : 2[3b000] -> 1[86000] via direct shared memory
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
mca-WS-C621E-SAGE-Series:852:903 [1] NCCL INFO comm 0x7f5454002e10 rank 1 nranks 3 cudaDev 1 busId 86000 - Init COMPLETE
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
mca-WS-C621E-SAGE-Series:853:902 [0] NCCL INFO comm 0x7fe1fc002e10 rank 2 nranks 3 cudaDev 0 busId 3b000 - Init COMPLETE
load checkpoint from local path: ./checkpoints/latest.pth
load checkpoint from local path: ./checkpoints/latest.pth
Traceback (most recent call last):
  File "tools/test.py", line 286, in <module>
    main()
  File "tools/test.py", line 257, in main
    or cfg.evaluation.get('gpu_collect', False))
  File "/home/project/mmdetection/mmdet/apis/test.py", line 109, in multi_gpu_test
Traceback (most recent call last):
  File "tools/test.py", line 286, in <module>
    main()
  File "tools/test.py", line 257, in main
    or cfg.evaluation.get('gpu_collect', False))
  File "/home/project/mmdetection/mmdet/apis/test.py", line 109, in multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    result = model(return_loss=False, rescale=True, **data)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return old_func(*args, **kwargs)
  File "/home/project/mmdetection/mmdet/models/detectors/base.py", line 174, in forward
    return old_func(*args, **kwargs)
  File "/home/project/mmdetection/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/project/mmdetection/mmdet/models/detectors/base.py", line 147, in forward_test
  File "/home/project/mmdetection/mmdet/models/detectors/base.py", line 147, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/project/mmdetection/mmdet/models/detectors/two_stage.py", line 177, in simple_test
  File "/home/project/mmdetection/mmdet/models/detectors/two_stage.py", line 177, in simple_test
    x = self.extract_feat(img)
    x = self.extract_feat(img)
  File "/home/project/mmdetection/mmdet/models/detectors/two_stage.py", line 67, in extract_feat
  File "/home/project/mmdetection/mmdet/models/detectors/two_stage.py", line 67, in extract_feat
    x = self.backbone(img)
    x = self.backbone(img)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/project/mmdetection/mmdet/models/backbones/resnet.py", line 636, in forward
    return forward_call(*input, **kwargs)
  File "/home/project/mmdetection/mmdet/models/backbones/resnet.py", line 636, in forward
    x = self.conv1(x)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    x = self.conv1(x)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking arugment for argument weight in method wrapper_cudnn_convolution)
    self.padding, self.dilation, self.groups)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking arugment for argument weight in method wrapper_cudnn_convolution)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 852) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=192.168.0.3
  master_port=29500
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[1, 2]
  global_ranks=[1, 2]
  role_world_sizes=[3, 3]
  global_world_sizes=[3, 3]

Is there anything I'm missing by mistake? master env

sys.platform: linux
Python: 3.6.8 (default, Sep 10 2021, 09:13:53) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)
PyTorch: 1.9.0+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu111
OpenCV: 4.5.4
MMCV: 1.4.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.25.0+ca11860

slave env

sys.platform: linux
Python: 3.6.8 (default, Sep 10 2021, 09:13:53) [GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)
PyTorch: 1.9.0+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu111
OpenCV: 4.6.0-dev
MMCV: 1.5.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.25.0+56e42e7
wdrink commented 2 years ago

I met the same error, I wonder whether anybody has fixed that?

lanqz7766 commented 2 years ago

Hello, I also have the same problem when I try to do inference on multiple GPUs. Did you find any solution to this problem?

miquel-espinosa commented 9 months ago

@rdj94 how did you solve the problem?