open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.37k stars 2.63k forks source link

IndexError: Target 15 is out of bounds. #1553

Open firqaaa opened 2 years ago

firqaaa commented 2 years ago

I am trying to run tutorial notebook in my local computer. Every step is fine until i face the error when running cell for Train and Evaluation. It says IndexError: Target 15 is out of bounds. I don't understand why the number 15 come in because we know that the amount of label is 8. For comparison, i run the notebook in collab and it work smoothly.

Here's my error :

Output exceeds the size limit. Open the full output data in a text editor 2022-05-05 13:44:15,427 - mmseg - INFO - Loaded 572 images /home/firqa/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/backbones/resnet.py:431: UserWarning: DeprecationWarning: pretrained is a deprecated, please use "init_cfg" instead warnings.warn('DeprecationWarning: pretrained is a deprecated, ' 2022-05-05 13:44:16,279 - mmseg - INFO - Loaded 143 images 2022-05-05 13:44:16,290 - mmseg - INFO - load checkpoint from local path: /mnt/d/Notebook/Projects/Python/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 2022-05-05 13:44:17,783 - mmseg - WARNING - The model and loaded state dict do not match exactly

size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([19, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([15, 512, 1, 1]). size mismatch for decode_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([15]). size mismatch for auxiliary_head.conv_seg.weight: copying a param with shape torch.Size([19, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([15, 256, 1, 1]). size mismatch for auxiliary_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([15]). 2022-05-05 13:44:17,795 - mmseg - INFO - Start running, host: firqa@firqaaa, work_dir: /mnt/d/Notebook/Projects/Python 2022-05-05 13:44:17,800 - mmseg - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) PolyLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_train_epoch: (VERY_HIGH ) PolyLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TextLoggerHook
2022-05-05 13:44:17,803 - mmseg - INFO - workflow: [('train', 1)], max: 200 iters 2022-05-05 13:44:17,806 - mmseg - INFO - Checkpoints will be saved to /mnt/d/Notebook/Projects/Python by HardDiskBackend.

IndexError Traceback (most recent call last) Input In [18], in <cell line: 17>() 15 # Create work_dir 16 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir)) ---> 17 train_segmentor(model, datasets, cfg, distributed=False, validate=True, 18 meta=dict())

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/apis/train.py:191, in train_segmentor(model, dataset, cfg, distributed, validate, timestamp, meta) 189 elif cfg.load_from: 190 runner.load_checkpoint(cfg.load_from) --> 191 runner.run(data_loaders, cfg.workflow)

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py:134, in IterBasedRunner.run(self, data_loaders, workflow, max_iters, kwargs) 132 if mode == 'train' and self.iter >= self._max_iters: 133 break --> 134 iter_runner(iter_loaders[i], kwargs) 136 time.sleep(1) # wait for some hooks like loggers to finish 137 self.call_hook('after_epoch')

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py:61, in IterBasedRunner.train(self, data_loader, kwargs) 59 data_batch = next(data_loader) 60 self.call_hook('before_train_iter') ---> 61 outputs = self.model.train_step(data_batch, self.optimizer, kwargs) 62 if not isinstance(outputs, dict): 63 raise TypeError('model.train_step() must return a dict')

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmcv/parallel/data_parallel.py:60, in MMDataParallel.train_step(self, *inputs, *kwargs) 56 if not self.device_ids: 57 # We add the following line thus the module could gather and 58 # convert data containers as those in GPU inference 59 inputs, kwargs = self.scatter(inputs, kwargs, [-1]) ---> 60 return self.module.train_step(inputs[0], **kwargs[0]) 62 assert len(self.device_ids) == 1, \ 63 ('MMDataParallel only supports single GPU training, if you need to' 64 ' train with multiple GPUs, please use MMDistributedDataParallel' 65 ' instead.') 67 for t in chain(self.module.parameters(), self.module.buffers()):

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/segmentors/base.py:138, in BaseSegmentor.train_step(self, data_batch, optimizer, kwargs) 112 def train_step(self, data_batch, optimizer, kwargs): 113 """The iteration step during training. 114 115 This method defines an iteration step during training, except for the (...) 136 averaging the logs. 137 """ --> 138 losses = self(**data_batch) 139 loss, log_vars = self._parse_losses(losses) 141 outputs = dict( 142 loss=loss, 143 log_vars=log_vars, 144 num_samples=len(data_batch['img_metas']))

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, *kwargs) 1106 # If we don't have any hooks, we want to skip the rest of the logic in 1107 # this function, and just call forward. 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py:110, in auto_fp16..auto_fp16_wrapper..new_func(*args, *kwargs) 107 raise TypeError('@auto_fp16 can only be used to decorate the ' 108 f'method of those classes {supported_types}') 109 if not (hasattr(args[0], 'fp16_enabled') and args[0].fp16_enabled): --> 110 return old_func(args, **kwargs) 112 # get the arg spec of the decorated method 113 args_info = getfullargspec(old_func)

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/segmentors/base.py:108, in BaseSegmentor.forward(self, img, img_metas, return_loss, kwargs) 98 """Calls either :func:forward_train or :func:forward_test depending 99 on whether return_loss is True. 100 (...) 105 the outer list indicating test time augmentations. 106 """ 107 if return_loss: --> 108 return self.forward_train(img, img_metas, kwargs) 109 else: 110 return self.forward_test(img, img_metas, **kwargs)

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/segmentors/encoder_decoder.py:143, in EncoderDecoder.forward_train(self, img, img_metas, gt_semantic_seg) 139 x = self.extract_feat(img) 141 losses = dict() --> 143 loss_decode = self._decode_head_forward_train(x, img_metas, 144 gt_semantic_seg) 145 losses.update(loss_decode) 147 if self.with_auxiliary_head:

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/segmentors/encoder_decoder.py:86, in EncoderDecoder._decode_head_forward_train(self, x, img_metas, gt_semantic_seg) 83 """Run forward function and calculate loss for decode head in 84 training.""" 85 losses = dict() ---> 86 loss_decode = self.decode_head.forward_train(x, img_metas, 87 gt_semantic_seg, 88 self.train_cfg) 90 losses.update(add_prefix(loss_decode, 'decode')) 91 return losses

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/decode_heads/decode_head.py:204, in BaseDecodeHead.forward_train(self, inputs, img_metas, gt_semantic_seg, train_cfg) 188 """Forward function for training. 189 Args: 190 inputs (list[Tensor]): List of multi-level img features. (...) 201 dict[str, Tensor]: a dictionary of loss components 202 """ 203 seg_logits = self.forward(inputs) --> 204 losses = self.losses(seg_logits, gt_semantic_seg) 205 return losses

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py:198, in force_fp32..force_fp32_wrapper..new_func(*args, *kwargs) 195 raise TypeError('@force_fp32 can only be used to decorate the ' 196 'method of nn.Module') 197 if not (hasattr(args[0], 'fp16_enabled') and args[0].fp16_enabled): --> 198 return old_func(args, **kwargs) 199 # get the arg spec of the decorated method 200 args_info = getfullargspec(old_func)

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/decode_heads/decode_head.py:252, in BaseDecodeHead.losses(self, seg_logit, seg_label) 250 for loss_decode in losses_decode: 251 if loss_decode.loss_name not in loss: --> 252 loss[loss_decode.loss_name] = loss_decode( 253 seg_logit, 254 seg_label, 255 weight=seg_weight, 256 ignore_index=self.ignore_index) 257 else: 258 loss[loss_decode.loss_name] += loss_decode( 259 seg_logit, 260 seg_label, 261 weight=seg_weight, 262 ignore_index=self.ignore_index)

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, *kwargs) 1106 # If we don't have any hooks, we want to skip the rest of the logic in 1107 # this function, and just call forward. 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/losses/cross_entropy_loss.py:271, in CrossEntropyLoss.forward(self, cls_score, label, weight, avg_factor, reduction_override, ignore_index, *kwargs) 269 class_weight = None 270 # Note: for BCE loss, label < 0 is invalid. --> 271 loss_cls = self.loss_weight self.cls_criterion( 272 cls_score, 273 label, 274 weight, 275 class_weight=class_weight, 276 reduction=reduction, 277 avg_factor=avg_factor, 278 avg_non_ignore=self.avg_non_ignore, 279 ignore_index=ignore_index, 280 **kwargs) 281 return loss_cls

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/mmseg/models/losses/cross_entropy_loss.py:45, in cross_entropy(pred, label, weight, class_weight, reduction, avg_factor, ignore_index, avg_non_ignore) 20 """cross_entropy. The wrapper function for :func:F.cross_entropy 21 22 Args: (...) 40 New in version 0.23.0. 41 """ 43 # class_weight is a manual rescaling weight given to each class. 44 # If given, has to be a Tensor of size C element-wise losses ---> 45 loss = F.cross_entropy( 46 pred, 47 label, 48 weight=class_weight, 49 reduction='none', 50 ignore_index=ignore_index) 52 # apply weights and do the reduction 53 # average loss over non-ignored elements 54 # pytorch's official cross_entropy average loss over non-ignored elements 55 # refer to https://github.com/pytorch/pytorch/blob/56b43f4fec1f76953f15a627694d4bba34588969/torch/nn/functional.py#L2660 # noqa 56 if (avg_factor is None) and avg_non_ignore and reduction == 'mean':

File ~/anaconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/functional.py:2996, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing) 2994 if size_average is not None or reduce is not None: 2995 reduction = _Reduction.legacy_get_string(size_average, reduce) -> 2996 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

IndexError: Target 15 is out of bounds.

thanos-sakelliou commented 2 years ago

I have a very similar error 255 out of bounds. And for me it occurs when I use avg_non_ignore like so :

# modify to ignore background
cfg.model.decode_head.ignore_index = 0
cfg.model.decode_head.loss_decode=dict(
            type='CrossEntropyLoss', reduction='mean', use_sigmoid=False, loss_weight=1.0, avg_non_ignore=True)
cfg.model.auxiliary_head.ignore_index = 0
cfg.model.auxiliary_head.loss_decode=dict(
            type='CrossEntropyLoss', reduction='mean', use_sigmoid=False, loss_weight=0.4, avg_non_ignore=True)
firqaaa commented 2 years ago

I have tried to add that configuration and change avg_non_ignore=False, but nothing change, it still return an error

linfangjian01 commented 2 years ago

"size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([19, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([15, 512, 1, 1])." This error appears to be a parameter error. I suggest checking the number of output channels for "decode_head. conv_seg" in the current model.

firqaaa commented 2 years ago

I have check the "cfg.model.decode_head" but no key named "conv_seg". Also when i run the code for first_time, it is true that the key "num_classes" has value 19. But we know that in the Train and Evaluation ( from Tutorial Notebook ) we change the "cfg.model.decode_head.num_classes" to 8 and it will change immediatly, right? So, i am still not understand how the number 15 come in.

Here is the dict of my config :

{ 'type': 'PSPHead', 'in_channels': 2048, 'in_index': 3, 'channels': 512, 'pool_scales': (1, 2, 3, 6), 'dropout_ratio': 0.1, 'num_classes': 19, 'norm_cfg': {'type': 'BN', 'requires_grad': True}, 'align_corners': False, 'loss_decode': {'type': 'CrossEntropyLoss', 'reduction': 'mean', 'use_sigmoid': False, 'loss_weight': 1.0 }

linfangjian01 commented 2 years ago

‘conv_seg’ in mmseg/models/decode_heads/decode_head.py, you can check the parameters from input to output again. I don't think we've had this problem in our implementation, so it could be a bug caused by user configuration. So I suggest you check the whole process from img input to backbone and then output to head. Also, you can use ide to globally search for the parameter "15" and see where it appears to find the bug.

firqaaa commented 2 years ago

I have checked and found nothing about "15" in my mmseg/models/decode_heads/decode_head.py

self.conv_seg = nn.Conv2d(channels, num_classes, kernel_size=1)

all in the default setting, never touch this file before.

conradfoo commented 2 years ago

Hi,

I had the same issue with the tutorial. I think the issue is that the segmentation maps are saved with all 19 classes, so the ground truth segmentation maps that get loaded in during training have 19 labels instead of 8. If you change the method to save the segmentation map to set everything greater than 7 to 7, it should work:

for file in mmcv.scandir(osp.join(data_root, ann_dir), suffix='.regions.txt'):
  seg_map = np.loadtxt(osp.join(data_root, ann_dir, file)).astype(np.uint8)
  seg_map[seg_map > 7] = 7 ##Just add this line.

  seg_img = Image.fromarray(seg_map).convert('P')
  seg_img.putpalette(np.array(palette, dtype=np.uint8))

  seg_img.save(osp.join(data_root, ann_dir, file.replace('.regions.txt', '.png')))

Initially, I tried setting those pixels to 255 (ignore_index), but that didn't seem to fix it.

HugeHamster-YWang commented 1 year ago

I change 'num_classes' to 255 and it runs without problems, but I'm not sure whether the result is the same. My dataset is a 2-class segmentation and I didn't preprocessed the grayscale picture, maybe this is the problem.

WEI9957 commented 7 months ago

I encountered the same error, and after various attempts, I finally found that this error will occur with the CPU in colab, and it can run normally with the GPU.The problem is solved, but does anyone know why