Errors occur when I set up an ignore_index in traning.

Shiming94 commented 3 years ago

Hi,

did someone meet the problem: the error RuntimeError: CUDA error: an illegal memory access was encountered will occur when an ignore_index for the model is set.

Without the ignore_index the code runs very well.

The error reports are as follows:

Traceback (most recent call last):
  File "train_hospital.py", line 193, in <module>
    train_segmentor(model, datasets, cfg, distributed=False, validate=True, timestamp= timestamp, meta=dict())
  File "/home/shiming/3.mmsegmentation_ws/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 130, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/shiming/3.mmsegmentation_ws/mmsegmentation/mmseg/models/segmentors/base.py", line 153, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home/shiming/3.mmsegmentation_ws/mmsegmentation/mmseg/models/segmentors/base.py", line 204, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f514b37677d in /home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f514b5c6d9d in /home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f514b362b1d in /home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7f5184bf90ea in /home/shiming/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: __libc_start_main + 0xe7 (0x7f51b5e32bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Best regards Shiming

xvjiarui commented 3 years ago

Hi @Shiming94 What is your ignore value? And you may provide your config file for more information.

Shiming94 commented 3 years ago

HI @xvjiarui, I usually set the last class as the ignore class, num - 1. And the default ignore value of the code is 255, right? Every time when I change the ignore value to num-1, the error will occur.

xvjiarui commented 3 years ago

The default is 255. You may check if all your data segmentation labels are within range.

Shiming94 commented 3 years ago

The default is 255. You may check if all your data segmentation labels are within range.

HI @xvjiarui For example, if I define 8 classes, the max value of the pixel should be 7. It seems that after some process a lot of pixels will become value 255. That is why ignore_index 255 can be accepted, but other values cannot. (when 255 is not ignored, it is definitely beyond the range.) So one solution is just to mask all the 255 to the ignore value we defined, say 7.

But I don't exactly know which process will generate a lot of pixels with value 255. Can you explain this to me?

Thanks a lot.

fangruizhu commented 3 years ago

@Shiming94 the augmentation procedure in the config file will add pixels with value 255, such as dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255).

JackeyGHD1 commented 3 years ago

@Shiming94 the augmentation procedure in the config file will add pixels with value 255, such as dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255).

I have some questions about the usage of this package, I want to communicate with you. I have emailed to you, I truly want to get your reply. Thanks a lot!

zenhanghg-heng commented 3 months ago

form the ' log_vars[loss_name] = loss_value.item()' ,i think it might happen for the wrong index labels in the dataset enhancing process.

checking you labels:

it should be using the id for each class rather than the rgb labels
checking the "ignore index" , for cross entropy function, its default "ignore index =-100", you should ignore the right index in your dataset config files,

as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.

for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),

open-mmlab / mmsegmentation

Errors occur when I set up an ignore_index in traning. #300