open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.03k stars 2.58k forks source link

Cuda error when adding class_weight to config #3574

Open jhaggle opened 7 months ago

jhaggle commented 7 months ago

I have searched related issues but cannot get the expected help.

I follow this tutorial:

https://github.com/open-mmlab/mmsegmentation/blob/main/demo/MMSegmentation_Tutorial.ipynb

If I follow the tutorial completely unchanged it works fine.

I then try to add Class Balanced Loss as in this tutorial:

https://mmsegmentation.readthedocs.io/en/latest/advanced_guides/training_tricks.html#class-balanced-loss

I therefore add this line to the cell in the tutorial where the config is modifed:

cfg.model.decode_head.loss_decode.update(dict(class_weight=[0.1, 0.1, 0.7, 0.1, 0.1, 0.1, 0.7, 0.1]))

HOWEVER, this results in this cuda error:

File /jupyterlab/mmsegmentation/mmseg/models/losses/accuracy.py:49, in accuracy(pred, target, topk, thresh, ignore_index)
     47     correct = correct & (pred_value > thresh).t()
     48 if ignore_index is not None:
---> 49     correct = correct[:, target != ignore_index]
     50 res = []
     51 eps = torch.finfo(torch.float32).eps

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I add os.environ['CUDA_LAUNCH_BLOCKING'] = "1" to the code in accordance with the cuda error I instead get this:

File /jupyterlab/mmsegmentation/mmseg/models/losses/cross_entropy_loss.py:66, in <listcomp>(.0)
     62         avg_factor = label.numel()
     64 else:
     65     # the average factor should take the class weights into account
---> 66     label_weights = torch.stack([class_weight[cls] for cls in label
     67                                  ]).to(device=class_weight.device)
     69     if avg_non_ignore:
     70         label_weights[label == ignore_index] = 0

RuntimeError: CUDA error: device-side assert triggered

The number of classes I have set should be right. If I try to change it to something else I instead get this error:

File /venv/lib/python3.10/site-packages/torch/nn/functional.py:3014, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3012 if size_average is not None or reduce is not None:
   3013     reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3014 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: weight tensor should be defined either for all or no classes

How can I avoid this error? And what is causing it?

talebolano commented 6 months ago

@jhaggle I have the same issue, I'm finding that when I set the class_weights, I get same error because the default ignore_index(255) exceeds the index of class_weights.(bacasue of class_weight[cls]) When this section is completely replaced with the original code, there will be no error

    if (avg_factor is None) and avg_non_ignore and reduction == 'mean':
        avg_factor = label.numel() - (label == ignore_index).sum().item()
    if weight is not None:
        weight = weight.float()
innavoig23 commented 6 months ago

I didn't understand. What should I do to avoid this CUDA Error? @talebolano