open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.69k stars 2.53k forks source link

Error class_weight when training #3630

Closed thangmanhbris closed 2 months ago

thangmanhbris commented 2 months ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug When I change the class weights when training the model, errors happened. There's no error without class weights change

/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() 04/10 11:51:48 - mmengine - WARNING - The prefix is not set in metric class IoUMetric. /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( 04/10 11:51:48 - mmengine - INFO - load model from: open-mmlab://resnet50_v1c 04/10 11:51:48 - mmengine - INFO - Loads checkpoint by openmmlab backend from path: open-mmlab://resnet50_v1c 04/10 11:51:48 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

Loads checkpoint by local backend from path: pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth The model and loaded state dict do not match exactly

size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([19, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 512, 1, 1]). size mismatch for decode_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for auxiliary_head.conv_seg.weight: copying a param with shape torch.Size([19, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 256, 1, 1]). size mismatch for auxiliary_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([8]). 04/10 11:51:48 - mmengine - INFO - Load checkpoint from pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 04/10 11:51:48 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 04/10 11:51:48 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 04/10 11:51:48 - mmengine - INFO - Checkpoints will be saved to /content/mmsegmentation/work_dirs/tutorial.

RuntimeError Traceback (most recent call last) in <cell line: 2>() 1 # start training ----> 2 runner.train()

11 frames /usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py in train(self) 1775 self._maybe_compile('train_step') 1776 -> 1777 model = self.train_loop.run() # type: ignore 1778 self.call_hook('after_run') 1779 return model

/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py in run(self) 284 285 data_batch = next(self.dataloader_iterator) --> 286 self.run_iter(data_batch) 287 288 self._decide_current_val_interval()

/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py in run_iter(self, data_batch) 307 # synchronization during gradient accumulation process. 308 # outputs should be a dict of loss. --> 309 outputs = self.runner.model.train_step( 310 data_batch, optim_wrapper=self.runner.optim_wrapper) 311

/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/base_model.py in train_step(self, data, optim_wrapper) 112 with optim_wrapper.optim_context(self): 113 data = self.data_preprocessor(data, True) --> 114 losses = self._run_forward(data, mode='loss') # type: ignore 115 parsed_losses, log_vars = self.parse_losses(losses) # type: ignore 116 optim_wrapper.update_params(parsed_losses)

/usr/local/lib/python3.10/dist-packages/mmengine/model/base_model/base_model.py in _run_forward(self, data, mode) 359 """ 360 if isinstance(data, dict): --> 361 results = self(*data, mode=mode) 362 elif isinstance(data, (list, tuple)): 363 results = self(data, mode=mode)

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], []

/content/mmsegmentation/mmseg/models/segmentors/base.py in forward(self, inputs, data_samples, mode) 92 """ 93 if mode == 'loss': ---> 94 return self.loss(inputs, data_samples) 95 elif mode == 'predict': 96 return self.predict(inputs, data_samples)

/content/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in loss(self, inputs, data_samples) 176 losses = dict() 177 --> 178 loss_decode = self._decode_head_forward_train(x, data_samples) 179 losses.update(loss_decode) 180

/content/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py in _decode_head_forward_train(self, inputs, data_samples) 137 training.""" 138 losses = dict() --> 139 loss_decode = self.decode_head.loss(inputs, data_samples, 140 self.train_cfg) 141

/content/mmsegmentation/mmseg/models/decode_heads/decode_head.py in loss(self, inputs, batch_data_samples, train_cfg) 260 """ 261 seg_logits = self.forward(inputs) --> 262 losses = self.loss_by_feat(seg_logits, batch_data_samples) 263 return losses 264

/content/mmsegmentation/mmseg/models/decode_heads/decode_head.py in loss_by_feat(self, seg_logits, batch_data_samples) 334 ignore_index=self.ignore_index) 335 --> 336 loss['acc_seg'] = accuracy( 337 seg_logits, seg_label, ignore_index=self.ignore_index) 338 return loss

/content/mmsegmentation/mmseg/models/losses/accuracy.py in accuracy(pred, target, topk, thresh, ignore_index) 47 correct = correct & (pred_value > thresh).t() 48 if ignore_index is not None: ---> 49 correct = correct[:, target != ignore_index] 50 res = [] 51 eps = torch.finfo(torch.float32).eps

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Reproduction

I run the mmsegmentation_turorial notebook on Colab:

Check nvcc version

!nvcc -V

Check GCC version

!gcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Install PyTorch

!pip install torch==1.12.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu113

Install MMCV

!pip install openmim !mim install mmengine

!mim install mmcv-full==1.6.0

!mim install 'mmcv>=2.0.0rc4'

!rm -rf mmsegmentation !git clone -b main https://github.com/open-mmlab/mmsegmentation.git %cd mmsegmentation !pip install -e .

!pip install ftfy

Check Pytorch installation

import torch, torchvision print(torch.version, torch.cuda.is_available())

Check MMSegmentation installation

import mmseg print(mmseg.version)

1.12.0+cu113 True 1.2.2

''' Change the class weights ''' cfg.model.decode_head.loss_decode.class_weight= [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]

  1. What command or script did you run?

    A placeholder for the command.
  2. Did you make any modifications on the code or config? Did you understand what you have modified? ''' Change the class weights ''' cfg.model.decode_head.loss_decode.class_weight= [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]

  3. What dataset did you use? I used Stanford Background Dataset as an example MMSegmentation_Tutorial_error.zip MMSegmentation_Tutorial_error.zip

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!