Closed uniyushu closed 2 years ago
I have met the same problem too. It seems to have something to do with the pytorch https://github.com/pytorch/pytorch/issues/21819
And this error is not raised everytime, it raises some time when running the code.
I have met the same problem too. It seems to have something to do with the pytorch pytorch/pytorch#21819
Thanks for the reply. It is true that it raises some time when running the code. The config could run the first time on a single GPU, I will try different batch sizes or maybe update my cuda drive to 11.
I have met the same problem too. It seems to have something to do with the pytorch pytorch/pytorch#21819
Thanks for the reply. It is true that it raises some time when running the code. The config could run the first time on a single GPU, I will try different batch sizes or maybe update my cuda drive to 11.
Yes,if you have fix the problem, please tell me. Thanks!
I also encountered the same problem, did you solve it?
I find a very strange solution to this error. When I change the binary mask I generated from png to jpg (the so called annotations or labels required for segmentation training), the error shows. When I switch back to png, the error disappears.
This is not surprising and is the correct solution. Because the jpg image is compressed, the image information will be lost and the pixel value of the label will be wrong. But png will not be compressed, so segmentation tags should be saved in png format. I had this problem before.
SHI Yuchen @.***> 于2022年10月26日周三 18:08写道:
I find a very strange solution to this error. When I change the binary mask I generated from png to jpg (the so called annotations or labels required for segmentation training), the error shows. When I switch back to png, the error disappears.
— Reply to this email directly, view it on GitHub https://github.com/open-mmlab/mmsegmentation/issues/1338#issuecomment-1291730743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3DP5KZWWKOXCKDXJZS5STWFDYJDANCNFSM5PTJWECQ . You are receiving this because you commented.Message ID: @.***>
Hi bro, I have solved this error, maybe you can check your batch size, when I use 2, I don't trigger this error, when I use 4, the error shows up
it's still there. I certainly use png and batch size from 2 to 8. Not helping. IMO it is due to torch multiprocessing thing. Here we should not use a single copy correct between workers.
form the 'correct = correct[:, target != ignore_index]
,i think it happens for the wrong index labels in the dataset enhancing process. or worng setting of ignore_index
in decoder head and loss function.
not correctly using ignore_index
in your decoder head, loss setting will cause the error, checking more details on correctly using theignore_index
in mmseg
checking you labels:
as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.
for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),
Checklist
Describe the bug 2022-03-01 07:52:37,584 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters 2022-03-01 07:52:37,585 - mmseg - INFO - Checkpoints will be saved to /data/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug by HardDiskBackend. /opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) Traceback (most recent call last): File "tools/train.py", line 234, in
main()
File "tools/train.py", line 223, in main
train_segmentor(
File "/data/mmsegmentation/mmseg/apis/train.py", line 174, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], kwargs)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, kwargs)
File "/data/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(inputs[0], kwargs[0])
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(data_batch)
File "/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(args, kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(args, kwargs)
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/data/mmsegmentation/mmseg/models/losses/accuracy.py", line 47, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57296eea22 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10983 (0x7f572994f983 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f5729951027 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f57296d85a4 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa27e1a (0x7f56d4a16e1a in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa27eb1 (0x7f56d4a16eb1 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)