RuntimeError: CUDA error: an illegal memory access was encountered

uniyushu commented 2 years ago

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug 2022-03-01 07:52:37,584 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters 2022-03-01 07:52:37,585 - mmseg - INFO - Checkpoints will be saved to /data/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug by HardDiskBackend. /opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) Traceback (most recent call last): File "tools/train.py", line 234, in main() File "tools/train.py", line 223, in main train_segmentor( File "/data/mmsegmentation/mmseg/apis/train.py", line 174, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], kwargs) File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 61, in train outputs = self.model.train_step(data_batch, self.optimizer, kwargs) File "/data/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step return self.module.train_step(inputs[0], kwargs[0]) File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step losses = self(data_batch) File "/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/data/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func return old_func(args, kwargs) File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward return self.forward_train(img, img_metas, kwargs) File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train loss_decode = self._decode_head_forward_train(x, img_metas, File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train loss_decode = self.decode_head.forward_train(x, img_metas, File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train losses = self.losses(seg_logits, gt_semantic_seg) File "/data/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func return old_func(args, kwargs) File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses loss['acc_seg'] = accuracy( File "/data/mmsegmentation/mmseg/models/losses/accuracy.py", line 47, in accuracy correct = correct[:, target != ignore_index] RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57296eea22 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10983 (0x7f572994f983 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f5729951027 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f57296d85a4 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa27e1a (0x7f56d4a16e1a in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa27eb1 (0x7f56d4a16eb1 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7f573a7ffb97 in /lib/x86_64-linux-gnu/libc.so.6) **Reproduction** 1. What command or script did you run? My config based on [fcn_r50-d8_512x512_20k_voc12aug.py](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/fcn/fcn_r50-d8_512x512_20k_voc12aug.py) ``` _base_ = [ '../_base_/models/fcn_r50-d8.py', '../_base_/default_runtime.py', '../_base_/schedules/schedule_20k.py' ] model = dict( decode_head=dict(num_classes=2), auxiliary_head=dict(num_classes=2)) # dataset settings dataset_type = 'PascalVOCDataset' data_root = '/data/ps_image_data/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) crop_size = (512, 512) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)), dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75), dict(type='RandomFlip', prob=0.5), dict(type='PhotoMetricDistortion'), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_semantic_seg']), ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(2048, 512), # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75], flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict(type='Normalize', **img_norm_cfg), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']), ]) ] data = dict( samples_per_gpu=4, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, img_dir='train/img', ann_dir='train/mask/', split='train/train.txt', pipeline=train_pipeline), val=dict( type=dataset_type, data_root=data_root, img_dir='train/img', ann_dir='train/mask/', split='train/val.txt', pipeline=test_pipeline), test=dict( type=dataset_type, data_root=data_root, img_dir='train/img', ann_dir='train/mask/', split='train/test.txt', pipeline=test_pipeline)) ``` 2. Did you make any modifications on the code or config? Did you understand what you have modified? I modified [voc.py](https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/datasets/voc.py) CLASSES = ('background', 'foreground') PALETTE = [[0, 0, 0], [255, 255, 255]] 3. What dataset did you use? I use my own two classes dataset, img is .jpg, mask is .png format. ![image](https://user-images.githubusercontent.com/20262193/156129522-0d0be561-a108-439b-87db-02087fd09a2c.png) **Environment** sys.platform: linux Python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.243 GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 PyTorch: 1.9.0+cu102 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb) - OpenMP 201511 (a.k.a. OpenMP 4.5) - NNPACK is enabled - CPU capability usage: AVX2 - CUDA Runtime 10.2 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70 - CuDNN 7.6.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, TorchVision: 0.10.0+cu102 OpenCV: 4.5.3 MMCV: 1.4.6 MMCV Compiler: GCC 7.4 MMCV CUDA Compiler: 10.1 MMSegmentation: 0.21.1+7cbd396

laibruce commented 2 years ago

I have met the same problem too. It seems to have something to do with the pytorch https://github.com/pytorch/pytorch/issues/21819

laibruce commented 2 years ago

And this error is not raised everytime, it raises some time when running the code.

uniyushu commented 2 years ago

I have met the same problem too. It seems to have something to do with the pytorch pytorch/pytorch#21819

Thanks for the reply. It is true that it raises some time when running the code. The config could run the first time on a single GPU, I will try different batch sizes or maybe update my cuda drive to 11.

laibruce commented 2 years ago

I have met the same problem too. It seems to have something to do with the pytorch pytorch/pytorch#21819

Thanks for the reply. It is true that it raises some time when running the code. The config could run the first time on a single GPU, I will try different batch sizes or maybe update my cuda drive to 11.

Yes，if you have fix the problem, please tell me. Thanks!

ZetaLx commented 2 years ago

I also encountered the same problem, did you solve it?

YuchenKid commented 1 year ago

I find a very strange solution to this error. When I change the binary mask I generated from png to jpg (the so called annotations or labels required for segmentation training), the error shows. When I switch back to png, the error disappears.

ZetaLx commented 1 year ago

This is not surprising and is the correct solution. Because the jpg image is compressed, the image information will be lost and the pixel value of the label will be wrong. But png will not be compressed, so segmentation tags should be saved in png format. I had this problem before.

SHI Yuchen @.***> 于2022年10月26日周三 18:08写道：

I find a very strange solution to this error. When I change the binary mask I generated from png to jpg (the so called annotations or labels required for segmentation training), the error shows. When I switch back to png, the error disappears.

— Reply to this email directly, view it on GitHub https://github.com/open-mmlab/mmsegmentation/issues/1338#issuecomment-1291730743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3DP5KZWWKOXCKDXJZS5STWFDYJDANCNFSM5PTJWECQ . You are receiving this because you commented.Message ID: @.***>

Tracy-git commented 1 year ago

Hi bro, I have solved this error, maybe you can check your batch size, when I use 2, I don't trigger this error, when I use 4, the error shows up

sipie800 commented 1 year ago

it's still there. I certainly use png and batch size from 2 to 8. Not helping. IMO it is due to torch multiprocessing thing. Here we should not use a single copy correct between workers.

zenhanghg-heng commented 4 months ago

form the 'correct = correct[:, target != ignore_index] ,i think it happens for the wrong index labels in the dataset enhancing process. or worng setting of ignore_index in decoder head and loss function.

not correctly using ignore_index in your decoder head, loss setting will cause the error, checking more details on correctly using theignore_index in mmseg

checking you labels:

it should be using the id for each class rather than the rgb labels
checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

open-mmlab / mmsegmentation

RuntimeError: CUDA error: an illegal memory access was encountered #1338