open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.27k stars 2.62k forks source link

CUDA error: an illegal memory access was encountered #42

Closed rassabin closed 4 years ago

rassabin commented 4 years ago
sys.platform: linux
Python: 3.7.7 (default, May  7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 1.0.4
MMSegmentation: 0.5.0+b57fb2b
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.0

Error was encountered during training process with condfigs:

Config:
norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='PSPHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        pool_scales=(1, 2, 3, 6),
        dropout_ratio=0.1,
        num_classes=9,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=9,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'Aircraft'
data_root = '/mmdetection_aircraft/data/segm2/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(640, 480), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(640, 480),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=1,
    train=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm2/',
        img_dir='JPEGImages',
        ann_dir='PaletteClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(640, 480), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ],
        split='train.txt'),
    val=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm2/',
        img_dir='JPEGImages',
        ann_dir='PaletteClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 480),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'),
    test=dict(
        type='Aircraft',
        data_root='/mmdetection_aircraft/data/segm2/',
        img_dir='JPEGImages',
        ann_dir='PaletteClass',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(640, 480),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        split='val.txt'))
log_config = dict(
    interval=1, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = 'checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 400
checkpoint_config = dict(by_epoch=False, interval=200)
evaluation = dict(interval=1, metric='mIoU')
work_dir = './work_dirs/pspnet'
seed = 0
gpu_ids = [1]

The script take an approximately 4-5GB of GPU from 11GB available and return this error:

ERROR

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-fec2661e1f4c> in <module>
     16 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
     17 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 
---> 18                 meta=dict())

~/mmsegmentation/mmseg/apis/train.py in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta)
    104     elif cfg.load_from:
    105         runner.load_checkpoint(cfg.load_from)
--> 106     runner.run(data_loaders, cfg.workflow, cfg.total_iters)

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in run(self, data_loaders, workflow, max_iters, **kwargs)
    117                     if mode == 'train' and self.iter >= max_iters:
    118                         break
--> 119                     iter_runner(iter_loaders[i], **kwargs)
    120 
    121         time.sleep(1)  # wait for some hooks like loggers to finish

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py in train(self, data_loader, **kwargs)
     53         self.call_hook('before_train_iter')
     54         data_batch = next(data_loader)
---> 55         outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
     56         if not isinstance(outputs, dict):
     57             raise TypeError('model.train_step() must return a dict')

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py in train_step(self, *inputs, **kwargs)
     29 
     30         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
---> 31         return self.module.train_step(*inputs[0], **kwargs[0])
     32 
     33     def val_step(self, *inputs, **kwargs):

~/mmsegmentation/mmseg/models/segmentors/base.py in train_step(self, data_batch, optimizer, **kwargs)
    150         #data_batch['gt_semantic_seg'] = data_batch['gt_semantic_seg'][:,:,:,:,0]
    151         #print(data_batch['gt_semantic_seg'].shape)
--> 152         losses = self.forward_train(**data_batch, **kwargs)
    153         loss, log_vars = self._parse_losses(losses)
    154 

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in forward_train(self, img, img_metas, gt_semantic_seg)
    155 
    156         loss_decode = self._decode_head_forward_train(x, img_metas,
--> 157                                                       gt_semantic_seg)
    158         losses.update(loss_decode)
    159 

~/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in _decode_head_forward_train(self, x, img_metas, gt_semantic_seg)
     99         loss_decode = self.decode_head.forward_train(x, img_metas,
    100                                                      gt_semantic_seg,
--> 101                                                      self.train_cfg)
    102 
    103         losses.update(add_prefix(loss_decode, 'decode'))

~/mmsegmentation/mmseg/models/decode_heads/decode_head.py in forward_train(self, inputs, img_metas, gt_semantic_seg, train_cfg)
    184         """
    185         seg_logits = self.forward(inputs)
--> 186         losses = self.losses(seg_logits, gt_semantic_seg)
    187         return losses
    188 

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py in new_func(*args, **kwargs)
    162                                 'method of nn.Module')
    163             if not (hasattr(args[0], 'fp16_enabled') and args[0].fp16_enabled):
--> 164                 return old_func(*args, **kwargs)
    165             # get the arg spec of the decorated method
    166             args_info = getfullargspec(old_func)

~/mmsegmentation/mmseg/models/decode_heads/decode_head.py in losses(self, seg_logit, seg_label)
    229             seg_label,
    230             weight=seg_weight,
--> 231             ignore_index=self.ignore_index)
    232         loss['acc_seg'] = accuracy(seg_logit, seg_label)
    233         return loss

~/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py in forward(self, cls_score, label, weight, avg_factor, reduction_override, **kwargs)
    175             class_weight=class_weight,
    176             reduction=reduction,
--> 177             avg_factor=avg_factor)
    178         return loss_cls

~/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py in cross_entropy(pred, label, weight, class_weight, reduction, avg_factor, ignore_index)
     28         weight = weight.float()
     29     loss = weight_reduce_loss(
---> 30         loss, weight=weight, reduction=reduction, avg_factor=avg_factor)
     31 
     32     return loss

~/mmsegmentation/mmseg/models/losses/utils.py in weight_reduce_loss(loss, weight, reduction, avg_factor)
     45     # if avg_factor is not specified, just reduce the loss
     46     if avg_factor is None:
---> 47         loss = reduce_loss(loss, reduction)
     48     else:
     49         # if reduction is mean, then average the loss by avg_factor

~/mmsegmentation/mmseg/models/losses/utils.py in reduce_loss(loss, reduction)
     19         return loss
     20     elif reduction_enum == 1:
---> 21         return loss.mean()
     22     elif reduction_enum == 2:
     23         return loss.sum()

RuntimeError: CUDA error: an illegal memory access was encountered

But if i reduce the size the image size twice with the same images per GPU (2) ,script takes approxiamtely 2GB from GPU and everything works fine. Also,i want to add that using another PyTorch script with my own Dataloader i'm able to fill in GPU on full (11GB) by training process with the same Torch version and the same hardware.

xvjiarui commented 4 years ago

Hi @rassabin For GPU memory usage, you may set cudnn_benchmark=False for more precise profile. As for the illegal memory access, you may try to use PyTorch 1.5 to see if it works.

UESTC-Liuxin commented 4 years ago

Hi @rassabin @xvjiarui Thanks to MMlab for providing such a good semantic segmentation framework. When I used a custom data set, I also encountered the same problem. Due to the extremely strong encapsulation feature of mmsegmentation, I spent at least 3 days and failed to locate the problem.

my env:

2020-08-02 17:45:33,618 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.7 (default, May  7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: GeForce RTX 2060 SUPER
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.6.0a0+82fd1c8
OpenCV: 4.3.0
MMCV: 1.0.4
MMSegmentation: 0.5.0+2b801de
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1

my configs

2020-08-02 17:37:38,478 - mmseg - INFO - Distributed training: True
2020-08-02 17:37:38,682 - mmseg - INFO - Config:
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='FCNHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        num_convs=2,
        concat_input=True,
        dropout_ratio=0.1,
        num_classes=17,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=17,
        norm_cfg=dict(type='SyncBN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'SkmtDataset'
data_root = 'data/VOCdevkit/Seg/skmt5'
img_norm_cfg = dict(
    mean=[34.73, 34.81, 34.45], std=[13.96, 13.93, 14.05], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[34.73, 34.81, 34.45],
        std=[13.96, 13.93, 14.05],
        to_rgb=True),
    dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 512),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[34.73, 34.81, 34.45],
                std=[13.96, 13.93, 14.05],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=4,
    workers_per_gpu=4,
    train=dict(
        type='SkmtDataset',
        data_root='data/VOCdevkit/Seg/skmt5',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        split='ImageSets/Segmentation/train.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[34.73, 34.81, 34.45],
                std=[13.96, 13.93, 14.05],
                to_rgb=True),
            dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='SkmtDataset',
        data_root='data/VOCdevkit/Seg/skmt5',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        split='ImageSets/Segmentation/val.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[34.73, 34.81, 34.45],
                        std=[13.96, 13.93, 14.05],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='SkmtDataset',
        data_root='data/VOCdevkit/Seg/skmt5',
        img_dir='JPEGImages',
        ann_dir='SegmentationClass',
        split='ImageSets/Segmentation/val.txt',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 512),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[34.73, 34.81, 34.45],
                        std=[13.96, 13.93, 14.05],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 20000
checkpoint_config = dict(by_epoch=False, interval=2000)
evaluation = dict(interval=2000, metric='mIoU')
work_dir = './work_dirs/fcn_r50-d8_512x512_20k_voc12aug'
gpu_ids = range(0, 1)

my erros:

2020-08-02 17:45:34,250 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Start running, host: liuxin@liuxin, work_dir: /media/Program/CV/Project/SKMT/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug
2020-08-02 17:45:35,864 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
Traceback (most recent call last):
  File "tools/train.py", line 159, in <module>
    main()
  File "tools/train.py", line 155, in main
    meta=meta)
  File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/apis/train.py", line 105, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 55, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/media/Program/CV/Project/mmcv/mmcv/parallel/distributed.py", line 36, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 153, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 204, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f7201a47b5e in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7f7201802e30 in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7201a356ed in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51ee0a (0x7f722ea62e0a in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1a311e (0x557099a8f11e in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #5: <unknown function> + 0xfdfc8 (0x5570999e9fc8 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #6: <unknown function> + 0x10f147 (0x5570999fb147 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #7: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #8: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #9: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #10: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #11: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #12: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #13: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #14: PyDict_SetItem + 0x502 (0x557099a50172 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #15: PyDict_SetItemString + 0x4f (0x557099a50c4f in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #16: PyImport_Cleanup + 0xa0 (0x557099a95760 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #17: Py_FinalizeEx + 0x67 (0x557099b10817 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #18: <unknown function> + 0x2373d3 (0x557099b233d3 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #19: _Py_UnixMain + 0x3c (0x557099b236fc in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #20: __libc_start_main + 0xe7 (0x7f7250c9bb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x1dc3c0 (0x557099ac83c0 in /home/liuxin/anaconda3/envs/mmlab/bin/python)

Traceback (most recent call last):
  File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/liuxin/anaconda3/envs/mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=0', 'configs/fcn/fcn_r50-d8_512x512_20k_voc12aug.py', '--launcher', 'pytorch']' died with <Signals.SIGABRT: 6>.

I use my custom dataset(the class_num is 17(including background);original img is RGB img ; gt_seg_map is P mode of PIL.)

UESTC-Liuxin commented 4 years ago

I solved this problem by modifying the label picture to one channel map which pixels consisted by class_id.

xvjiarui commented 4 years ago

Hi @UESTC-Liuxin Thanks for the information. I will add some doc about the label picture.

fangruizhu commented 4 years ago

Hi, did you add it to the doc. ? I encountered the same problem.

At the beginning, I ran with 1 sample per gpu and it worked fine.

data = dict( samples_per_gpu=1, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, img_dir='images/training', ann_dir='annotations/training', pipeline=train_pipeline),

After I changed the setting to 2 samples per gpu, I raises the error.

data = dict( samples_per_gpu=2, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, img_dir='images/training', ann_dir='annotations/training', pipeline=train_pipeline),

The error:

File "/home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 36, in train_step output = self.module.train_step(*inputs[0], *kwargs[0]) File "/home/ubuntu/work/models/model.py", line 153, in train_step loss, log_vars = self._parse_losses(losses) File "/home/ubuntu/work/models/model.py", line 193, in _parse_losses log_vars[loss_name] = loss_value.item() RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914855613/work/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fd854c0fb5e in /home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x6d0 (0x7fd854e54e30 in /home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)

Any ideas of that? Thank you very much~

xvjiarui commented 4 years ago

Hi @fangruizhu I created a PR to add docs. Are you using distributed training or not? Which dataset are you using?

fangruizhu commented 4 years ago

Thanks for the quick reply! @xvjiarui Yes, I use distributed training on ADE20K data. I used my own model, while the code is based on 'https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/segmentors/encoder_decoder.py'. I'm quite confused where the error might be, since running with the config 'https://github.com/open-mmlab/mmsegmentation/blob/master/configs/fcn/fcn_r50-d8_512x512_80k_ade20k.py' is okay for me.

xvjiarui commented 4 years ago

Hi @fangruizhu You may check the following.

  1. The num_classes of your own model.
  2. Whether training with 1 image/GPU could continue for more steps without error.
fangruizhu commented 4 years ago

Thanks! @xvjiarui It seems quite weird. The num_classes is 150, the same as in ADE20k general setting. And it works fine with 1 image/GPU by distributed training.

When I trained with a single gpu with multiple samples on it, it got the same error. The error occurs at https://github.com/open-mmlab/mmsegmentation/blob/3e49d0ad7174c062cb4a3d72eb8e71b94d5ba0fd/mmseg/models/segmentors/base.py#L204. The submodule of my model is directly derived from nn.Module and are not registered with register_module. Would that cause problems?

xvjiarui commented 4 years ago

Hi @fangruizhu You may try if the original configs provided in the repo works in 2 images/GPU, e.g. pspnet_r50-d8_512x512_160k_ade20k.py. If so, may I have a look at your config file? On the other hand, sometimes when CrossEntropy gets the incorrect match of input and label, there will be CUDA error.

fangruizhu commented 4 years ago

Thanks! @xvjiarui I fixed the error by re-installing the mmcv lite version, disabling CUDA complier in mmcv. I think maybe the error was caused by CUDA environment.

York1996OutLook commented 4 years ago

because we use labelme,and some Data Annotator write the wrong new label,account for 11 classes ,rather than 10 classes.......

553566286 commented 3 years ago

check your labels and the dataset configs if you encountered this error.

tetsu-kikuchi commented 3 years ago

In my case, I used my custom dataset and used config

python tools/train.py  configs/deeplabv3plus/deeplabv3plus_r101-d8_512x512_160k_ade20k.py

What was a trap is that we have to change num_classes in

model = dict(
decode_head=dict(num_classes=150), auxiliary_head=dict(num_classes=150))

in configs/deeplabv3plus/deeplabv3plus_r50-d8_512x512_160k_ade20k.py, as well as num_classes in configs/base/models/deeplabv3plus_r50-d8.py.

The config style is slightly different depending on a dataset type you use. Moreover, the same information (num_classes, in my case) appears here and there in a set of relevant config files. So we really have to be careful whether we have completely modified configs according to our custom datasets...

adriengoleb commented 1 year ago

Hello,

I would like to reopen this issue because I have exactly the same problem described above.

I want to run the segmentation training code for cityscapes as follow with 1 GPU. However, I obtain this error :


(/data/users/agolebiewski/conda-envs/segformer) python tools/train.py local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py --gpus 1
2023-05-22 11:20:22,493 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) [GCC 11.3.0]
CUDA available: True
GPU 0,1,2: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.7.r11.7/compiler.31294372_0
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMSegmentation: 0.11.0+1a8ad51
------------------------------------------------------------

2023-05-22 11:20:22,493 - mmseg - INFO - Distributed training: False
2023-05-22 11:20:22,669 - mmseg - INFO - Config:
norm_cfg = dict(type='BN', requires_grad=True)
find_unused_parameters = True
model = dict(
    type='EncoderDecoder',
    pretrained='pretrained/mit_b1.pth',
    backbone=dict(type='mit_b1', style='pytorch'),
    decode_head=dict(
        type='SegFormerHead',
        in_channels=[64, 128, 320, 512],
        in_index=[0, 1, 2, 3],
        feature_strides=[4, 8, 16, 32],
        channels=128,
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        decoder_params=dict(embed_dim=256),
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    train_cfg=dict(),
    test_cfg=dict(mode='slide', crop_size=(1024, 1024), stride=(768, 768)))
dataset_type = 'CityscapesDataset'
data_root = 'data/cityscapes/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (1024, 1024)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(1024, 1024), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(1024, 1024), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 1024),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type='RepeatDataset',
        times=500,
        dataset=dict(
            type='CityscapesDataset',
            data_root='data/cityscapes/',
            img_dir='leftImg8bit/train',
            ann_dir='gtFine/train',
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations'),
                dict(
                    type='Resize',
                    img_scale=(2048, 1024),
                    ratio_range=(0.5, 2.0)),
                dict(
                    type='RandomCrop',
                    crop_size=(1024, 1024),
                    cat_max_ratio=0.75),
                dict(type='RandomFlip', prob=0.5),
                dict(type='PhotoMetricDistortion'),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(
                    type='Pad', size=(1024, 1024), pad_val=0, seg_pad_val=255),
                dict(type='DefaultFormatBundle'),
                dict(type='Collect', keys=['img', 'gt_semantic_seg'])
            ])),
    val=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 1024),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        img_dir='leftImg8bit/val',
        ann_dir='gtFine/val',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2048, 1024),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        dict(type='TensorboardLoggerHook')
    ])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(
    type='AdamW',
    lr=6e-05,
    betas=(0.9, 0.999),
    weight_decay=0.01,
    paramwise_cfg=dict(
        custom_keys=dict(
            pos_block=dict(decay_mult=0.0),
            norm=dict(decay_mult=0.0),
            head=dict(lr_mult=10.0))))
optimizer_config = dict()
lr_config = dict(
    policy='poly',
    warmup='linear',
    warmup_iters=1500,
    warmup_ratio=1e-06,
    power=1.0,
    min_lr=0.0,
    by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=160000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=4000, metric='mIoU')
work_dir = './work_dirs/segformer.b1.1024x1024.city.160k'
gpu_ids = range(0, 1)

2023-05-22 11:20:23,032 - mmseg - INFO - Use load_from_local loader
2023-05-22 11:20:23,076 - mmseg - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: head.weight, head.bias

2023-05-22 11:20:23,078 - mmseg - INFO - EncoderDecoder(
  (backbone): mit_b1(
    (patch_embed1): OverlapPatchEmbed(
      (proj): Conv2d(3, 64, kernel_size=(7, 7), stride=(4, 4), padding=(3, 3))
      (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    )
    (patch_embed2): OverlapPatchEmbed(
      (proj): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
    (patch_embed3): OverlapPatchEmbed(
      (proj): Conv2d(128, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
    )
    (patch_embed4): OverlapPatchEmbed(
      (proj): Conv2d(320, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (block1): ModuleList(
      (0): Block(
        (norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=64, out_features=64, bias=True)
          (kv): Linear(in_features=64, out_features=128, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=64, out_features=64, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(64, 64, kernel_size=(8, 8), stride=(8, 8))
          (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): Identity()
        (norm2): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=64, out_features=256, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
          )
          (act): GELU()
          (fc2): Linear(in_features=256, out_features=64, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=64, out_features=64, bias=True)
          (kv): Linear(in_features=64, out_features=128, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=64, out_features=64, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(64, 64, kernel_size=(8, 8), stride=(8, 8))
          (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=64, out_features=256, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
          )
          (act): GELU()
          (fc2): Linear(in_features=256, out_features=64, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
    (block2): ModuleList(
      (0): Block(
        (norm1): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=128, out_features=128, bias=True)
          (kv): Linear(in_features=128, out_features=256, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=128, out_features=128, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(128, 128, kernel_size=(4, 4), stride=(4, 4))
          (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=128, out_features=512, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=512)
          )
          (act): GELU()
          (fc2): Linear(in_features=512, out_features=128, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=128, out_features=128, bias=True)
          (kv): Linear(in_features=128, out_features=256, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=128, out_features=128, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(128, 128, kernel_size=(4, 4), stride=(4, 4))
          (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=128, out_features=512, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=512)
          )
          (act): GELU()
          (fc2): Linear(in_features=512, out_features=128, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
    (block3): ModuleList(
      (0): Block(
        (norm1): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=320, out_features=320, bias=True)
          (kv): Linear(in_features=320, out_features=640, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=320, out_features=320, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(320, 320, kernel_size=(2, 2), stride=(2, 2))
          (norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=320, out_features=1280, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1280)
          )
          (act): GELU()
          (fc2): Linear(in_features=1280, out_features=320, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=320, out_features=320, bias=True)
          (kv): Linear(in_features=320, out_features=640, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=320, out_features=320, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
          (sr): Conv2d(320, 320, kernel_size=(2, 2), stride=(2, 2))
          (norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=320, out_features=1280, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1280)
          )
          (act): GELU()
          (fc2): Linear(in_features=1280, out_features=320, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm3): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
    (block4): ModuleList(
      (0): Block(
        (norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=512, out_features=512, bias=True)
          (kv): Linear(in_features=512, out_features=1024, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=512, out_features=512, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(2048, 2048, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2048)
          )
          (act): GELU()
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (q): Linear(in_features=512, out_features=512, bias=True)
          (kv): Linear(in_features=512, out_features=1024, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=512, out_features=512, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (dwconv): DWConv(
            (dwconv): Conv2d(2048, 2048, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2048)
          )
          (act): GELU()
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (drop): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm4): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decode_head): SegFormerHead(
    input_transform=multiple_select, ignore_index=255, align_corners=False
    (loss_decode): CrossEntropyLoss()
    (conv_seg): Conv2d(128, 19, kernel_size=(1, 1), stride=(1, 1))
    (dropout): Dropout2d(p=0.1, inplace=False)
    (linear_c4): MLP(
      (proj): Linear(in_features=512, out_features=256, bias=True)
    )
    (linear_c3): MLP(
      (proj): Linear(in_features=320, out_features=256, bias=True)
    )
    (linear_c2): MLP(
      (proj): Linear(in_features=128, out_features=256, bias=True)
    )
    (linear_c1): MLP(
      (proj): Linear(in_features=64, out_features=256, bias=True)
    )
    (linear_fuse): ConvModule(
      (conv): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (linear_pred): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1))
  )
)
2023-05-22 11:20:23,130 - mmseg - INFO - Loaded 2975 images
2023-05-22 11:20:25,113 - mmseg - INFO - Loaded 500 images
2023-05-22 11:20:25,113 - mmseg - INFO - Start running, host: d624032@rosetta-c4140gpu02, work_dir: /gpfs_new/scratch/users/agolebiewski/SegFormer/work_dirs/segformer.b1.1024x1024.city.160k
2023-05-22 11:20:25,113 - mmseg - INFO - workflow: [('train', 1)], max: 160000 iters
[W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())
Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 155, in main
    train_segmentor(
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/apis/train.py", line 115, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 153, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 204, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629395347/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feaf8dd377d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7feaf9023d9d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7feaf8dbfb1d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53956b (0x7feb3693156b in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xf5 (0x7feb63b7a555 in /lib64/libc.so.6)

Aborted

The error occurs at the same location (like @xvjiarui)

I attempted to follow the others issues comments. But I always obtain this CUDA error: an illegal memory access was encountered ...

Thanks! @xvjiarui It seems quite weird. The num_classes is 150, the same as in ADE20k general setting. And it works fine with 1 image/GPU by distributed training.

When I trained with a single gpu with multiple samples on it, it got the same error. The error occurs at

https://github.com/open-mmlab/mmsegmentation/blob/3e49d0ad7174c062cb4a3d72eb8e71b94d5ba0fd/mmseg/models/segmentors/base.py#L204

. The submodule of my model is directly derived from nn.Module and are not registered with register_module. Would that cause problems?

zenhanghg-heng commented 5 months ago

form the ' log_vars[loss_name] = loss_value.item()' ,i think it happens for the wrong index in labels in the dataset enhancing process.

checking you labels:

  1. it should be using the id for each class rather than the rgb labels
  2. checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),