open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.02k stars 2.57k forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2380846) #3106

Open ShubhamAbhayDeshpande opened 1 year ago

ShubhamAbhayDeshpande commented 1 year ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug I am using mmsegmentation for Freiburg Forest dataset. I have made my own basic config files for dataset according to documentation given for adding custom dataset. The command I have used for training is mentioned below.

I am getting following error when I try to train on the dataset. I am concerned about error itself as well as warning before errors.

Can someone please help me with this?

Reproduction

  1. What command or script did you run?

    CUDA_VISIBLE_DEVICES=1,2,3 sh tools/dist_train.sh configs/deeplabv3/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400.py 3 
  2. I have made config file for dataset on my own with reference of documentation and it is almost same as config file for cityscape dataset.

  3. I am using freiburg foreset dataset. I have taken care as to match this dataset to the format of cityscape dataset.

Environment

The details of the environment that I am using is given below.

sys.platform: linux
Python: 3.8.16 (default, Mar  2 2023, 03:21:46) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.7, V11.7.99
GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
PyTorch: 1.13.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.0
OpenCV: 4.7.0
MMEngine: 0.7.3
MMSegmentation: 1.0.0+e64548f

Error traceback

/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
config file path:  configs/deeplabv3/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400.py
config file path:  configs/deeplabv3/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400.py
config file path:  configs/deeplabv3/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400.py
config launcher  pytorch
default runner
config launcher  pytorch
default runner
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
config launcher  pytorch
default runner
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
06/14 22:36:00 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.8.16 (default, Mar  2 2023, 03:21:46) [GCC 11.2.0]
    CUDA available: True
    numpy_random_seed: 157826940
    GPU 0,1,2: NVIDIA GeForce RTX 2080 Ti
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 11.7, V11.7.99
    GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
    PyTorch: 1.13.0
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.14.0
    OpenCV: 4.7.0
    MMEngine: 0.7.3

Runtime environment:
    cudnn_benchmark: True
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 3
------------------------------------------------------------

06/14 22:36:01 - mmengine - INFO - Config:
norm_cfg = dict(type='SyncBN', requires_grad=True)
data_preprocessor = dict(
    type='SegDataPreProcessor',
    mean=[123.675, 116.28, 103.53],
    std=[58.395, 57.12, 57.375],
    bgr_to_rgb=False,
    pad_val=0,
    seg_pad_val=255,
    size=(256, 256))
model = dict(
    type='EncoderDecoder',
    data_preprocessor=dict(
        type='SegDataPreProcessor',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=False,
        pad_val=0,
        seg_pad_val=255,
        size=(256, 256)),
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='ASPPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dilations=(1, 12, 24, 36),
        dropout_ratio=0.1,
        num_classes=6,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=6,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))
dataset_type = 'FreiburgForestDataset'
data_root = '/home/deshpand/noadsm/datasets/freiburg_forest_mmsegmentation/freiburg_forest/'
crop_size = (256, 256)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='RandomCrop', crop_size=(256, 256), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(type='PackSegInputs')
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='PackSegInputs')
]
train_dataloader = dict(
    batch_size=2,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='InfiniteSampler', shuffle=True),
    dataset=dict(
        type='FreiburgForestDataset',
        data_root=
        '/home/deshpand/noadsm/datasets/freiburg_forest_mmsegmentation/freiburg_forest/',
        data_prefix=dict(
            img_path='imgs/train', seg_map_path='masks/train/gray_mask'),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='RandomCrop', crop_size=(256, 256), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(type='PackSegInputs')
        ]))
val_dataloader = dict(
    batch_size=2,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='FreiburgForestDataset',
        data_root=
        '/home/deshpand/noadsm/datasets/freiburg_forest_mmsegmentation/freiburg_forest/',
        data_prefix=dict(
            img_path='imgs/test', seg_map_path='masks/test/gray_mask'),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='PackSegInputs')
        ]))
test_dataloader = dict(
    batch_size=2,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='FreiburgForestDataset',
        data_root=
        '/home/deshpand/noadsm/datasets/freiburg_forest_mmsegmentation/freiburg_forest/',
        data_prefix=dict(
            img_path='imgs/test', seg_map_path='masks/test/gray_mask'),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='PackSegInputs')
        ]))
val_evaluator = dict(type='IoUMetric', iou_metrics=['mIoU'])
test_evaluator = dict(type='IoUMetric', iou_metrics=['mIoU'])
default_scope = 'mmseg'
env_cfg = dict(
    cudnn_benchmark=True,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='SegLocalVisualizer',
    vis_backends=[dict(type='LocalVisBackend')],
    name='visualizer')
log_processor = dict(by_epoch=False)
log_level = 'INFO'
load_from = None
resume = False
tta_model = dict(type='SegTTAModel')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005),
    clip_grad=None)
param_scheduler = [
    dict(
        type='PolyLR',
        eta_min=0.0001,
        power=0.9,
        begin=0,
        end=40000,
        by_epoch=False)
]
train_cfg = dict(type='IterBasedTrainLoop', max_iters=40000, val_interval=4000)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50, log_metric_by_epoch=False),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', by_epoch=False, interval=4000),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='SegVisualizationHook'))
launcher = 'pytorch'
work_dir = './work_dirs/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400'

/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead
  warnings.warn('DeprecationWarning: pretrained is a deprecated, '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/builder.py:36: UserWarning: ``build_loss`` would be deprecated soon, please use ``mmseg.registry.MODELS.build()`` 
  warnings.warn('``build_loss`` would be deprecated soon, please use '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
  warnings.warn(
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/builder.py:36: UserWarning: ``build_loss`` would be deprecated soon, please use ``mmseg.registry.MODELS.build()`` 
  warnings.warn('``build_loss`` would be deprecated soon, please use '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
  warnings.warn(
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/builder.py:36: UserWarning: ``build_loss`` would be deprecated soon, please use ``mmseg.registry.MODELS.build()`` 
  warnings.warn('``build_loss`` would be deprecated soon, please use '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:235: UserWarning: Default ``avg_non_ignore`` is False, if you would like to ignore the certain label and average loss over non-ignore labels, which is the same with PyTorch official cross_entropy, set ``avg_non_ignore=True``.
  warnings.warn(
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/engine/hooks/visualization_hook.py:61: UserWarning: The draw is False, it means that the hook for visualization will not take effect. The results will NOT be visualized or stored.
  warnings.warn('The draw is False, it means that the '
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/engine/hooks/visualization_hook.py:61: UserWarning: The draw is False, it means that the hook for visualization will not take effect. The results will NOT be visualized or stored.
  warnings.warn('The draw is False, it means that the '
06/14 22:36:03 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) RuntimeInfoHook                    
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
before_train:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) DistSamplerSeedHook                
 -------------------- 
before_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_train_iter:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(NORMAL      ) SegVisualizationHook               
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) IterTimerHook                      
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_val_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_val_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_val_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) SegVisualizationHook               
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_val_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
(LOW         ) ParamSchedulerHook                 
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
after_train:
(VERY_LOW    ) CheckpointHook                     
 -------------------- 
before_test_epoch:
(NORMAL      ) IterTimerHook                      
 -------------------- 
before_test_iter:
(NORMAL      ) IterTimerHook                      
 -------------------- 
after_test_iter:
(NORMAL      ) IterTimerHook                      
(NORMAL      ) SegVisualizationHook               
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_test_epoch:
(VERY_HIGH   ) RuntimeInfoHook                    
(NORMAL      ) IterTimerHook                      
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
after_run:
(BELOW_NORMAL) LoggerHook                         
 -------------------- 
/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/engine/hooks/visualization_hook.py:61: UserWarning: The draw is False, it means that the hook for visualization will not take effect. The results will NOT be visualized or stored.
  warnings.warn('The draw is False, it means that the '
loop_cfg:  {'type': 'IterBasedTrainLoop', 'max_iters': 40000, 'val_interval': 4000}
loop_cfg:  {'type': 'IterBasedTrainLoop', 'max_iters': 40000, 'val_interval': 4000}
loop:  <mmengine.runner.loops.IterBasedTrainLoop object at 0x7f8a52e42310>
type is in the loop_cfg, so IterationBasedLoop executed. Check the dataloader.
loop:  <mmengine.runner.loops.IterBasedTrainLoop object at 0x7febd2102310>
type is in the loop_cfg, so IterationBasedLoop executed. Check the dataloader.
loop_cfg:  {'type': 'IterBasedTrainLoop', 'max_iters': 40000, 'val_interval': 4000}
loop:  <mmengine.runner.loops.IterBasedTrainLoop object at 0x7f0d92b58310>
type is in the loop_cfg, so IterationBasedLoop executed. Check the dataloader.
06/14 22:36:04 - mmengine - WARNING - The prefix is not set in metric class IoUMetric.
06/14 22:36:05 - mmengine - INFO - load model from: open-mmlab://resnet50_v1c
06/14 22:36:05 - mmengine - INFO - Loads checkpoint by openmmlab backend from path: open-mmlab://resnet50_v1c
06/14 22:36:05 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

06/14 22:36:05 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
06/14 22:36:05 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
06/14 22:36:05 - mmengine - INFO - Checkpoints will be saved to /home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/work_dirs/deeplab_r50-d8_4xb2-40k_freiburgforest-800x400.
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
Traceback (most recent call last):
  File "tools/train.py", line 115, in <module>
    main()
  File "tools/train.py", line 111, in main
    runner.train()
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1733, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 278, in run
    self.run_iter(data_batch)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 301, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    results = self(**data, mode=mode)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/base.py", line 94, in forward
Traceback (most recent call last):
  File "tools/train.py", line 115, in <module>
    return self.loss(inputs, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 177, in loss
    main()
  File "tools/train.py", line 111, in main
    runner.train()
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1733, in train
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 138, in _decode_head_forward_train
    loss_decode = self.decode_head.loss(inputs, data_samples,
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 262, in loss
    model = self.train_loop.run()  # type: ignore
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 278, in run
    self.run_iter(data_batch)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 301, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self.loss_by_feat(seg_logits, batch_data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 336, in loss_by_feat
    losses = self._run_forward(data, mode='loss')
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    loss['acc_seg'] = accuracy(
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
    results = self(**data, mode=mode)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/base.py", line 94, in forward
    return self.loss(inputs, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 177, in loss
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 138, in _decode_head_forward_train
    loss_decode = self.decode_head.loss(inputs, data_samples,
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 262, in loss
    losses = self.loss_by_feat(seg_logits, batch_data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 336, in loss_by_feat
    loss['acc_seg'] = accuracy(
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "tools/train.py", line 115, in <module>
    main()
  File "tools/train.py", line 111, in main
    runner.train()
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1733, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 278, in run
    self.run_iter(data_batch)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 301, in run_iter
    outputs = self.runner.model.train_step(
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    results = self(**data, mode=mode)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/base.py", line 94, in forward
    return self.loss(inputs, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 177, in loss
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 138, in _decode_head_forward_train
    loss_decode = self.decode_head.loss(inputs, data_samples,
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 262, in loss
    losses = self.loss_by_feat(seg_logits, batch_data_samples)
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 336, in loss_by_feat
    loss['acc_seg'] = accuracy(
  File "/home/deshpand/Thesis/semantic_segmentation_network/mmseg_new/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2385922) of binary: /home/deshpand/anaconda3/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/deshpand/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
tools/train.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-14_22:36:10
  host      : neptun.informatik.uni-kl.de
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2385923)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2385923
[2]:
  time      : 2023-06-14_22:36:10
  host      : neptun.informatik.uni-kl.de
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 2385924)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2385924
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-14_22:36:10
  host      : neptun.informatik.uni-kl.de
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 2385922)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2385922

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

bjzhb666 commented 8 months ago

How to fix it? Thanks!