open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.23k stars 2.61k forks source link

Realtime segmenters on ADE20k triggers "CUDA error: an illegal memory access was encountered" #2040

Closed fingertap closed 2 years ago

fingertap commented 2 years ago

I was trying to train real-time segmenters on ADE20k. However, models such as ICNet, mobilenet_v3, fast scnn gave me CUDA error: an illegal memory access was encountered error. The full trace is attached below. I suspect it is a bug of mmseg, but I had troubles digging into it, as mmseg does not give me an option of using cpu to train, where I can get a better error trace. Anyway, I need help on this issue of making real-time segmenters work on ADE20k. BTW, bisenetv2 and mobilenetv2 worked fine.

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/spnm/tools/train.py", line 183, in <module>
    main()
  File "/home/spnm/tools/train.py", line 179, in main
    meta=meta)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/apis/train.py", line 194, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/root/mmcv/mmcv/runner/iter_based_runner.py", line 144, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/root/mmcv/mmcv/runner/iter_based_runner.py", line 64, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/root/mmcv/mmcv/parallel/data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/segmentors/base.py", line 138, in train_step
    losses = self(**data_batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/mmcv/mmcv/runner/fp16_utils.py", line 116, in new_func
    return old_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/segmentors/base.py", line 108, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
    gt_semantic_seg)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
    self.train_cfg)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
    losses = self.losses(seg_logits, gt_semantic_seg)
  File "/root/mmcv/mmcv/runner/fp16_utils.py", line 205, in new_func
    return old_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/decode_heads/decode_head.py", line 265, in losses
    seg_logit, seg_label, ignore_index=self.ignore_index)
  File "/usr/local/lib/python3.6/dist-packages/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /root/orion-supported-framework-collection/src/pytorch/pytorch-1.8.0/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fe6f2dccc1d in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x7fe6f2dca0c8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x92a (0x7fe6fb1cf12a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fe6f2db2f04 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6d28fa (0x7fe650bf58fa in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python() [0x54fb56]
frame #6: /usr/bin/python() [0x573500]
frame #7: /usr/bin/python() [0x54f85b]
frame #8: /usr/bin/python() [0x54f85b]
frame #9: /usr/bin/python() [0x589198]
frame #10: /usr/bin/python() [0x5ad918]
frame #11: /usr/bin/python() [0x5ad92e]
frame #12: /usr/bin/python() [0x5ad92e]
frame #13: /usr/bin/python() [0x5ad92e]
frame #14: /usr/bin/python() [0x5ad92e]
frame #15: /usr/bin/python() [0x5ad92e]
frame #16: /usr/bin/python() [0x5ad92e]
frame #17: /usr/bin/python() [0x5ad92e]
frame #18: /usr/bin/python() [0x5ad92e]
frame #19: /usr/bin/python() [0x5ad92e]
frame #20: /usr/bin/python() [0x5ad92e]
frame #21: /usr/bin/python() [0x5ad92e]
frame #22: /usr/bin/python() [0x5ad92e]
frame #23: /usr/bin/python() [0x5ad92e]
frame #24: /usr/bin/python() [0x56bf26]
frame #25: PyDict_SetItemString + 0x153 (0x571673 in /usr/bin/python)
frame #26: PyImport_Cleanup + 0x76 (0x4f2fc6 in /usr/bin/python)
frame #27: Py_FinalizeEx + 0x5e (0x637f6e in /usr/bin/python)
frame #28: Py_Main + 0x395 (0x638fd5 in /usr/bin/python)
frame #29: main + 0xe0 (0x4b0d30 in /usr/bin/python)
frame #30: __libc_start_main + 0xe7 (0x7fe72ede8bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #31: _start + 0x2a (0x5b2a5a in /usr/bin/python)
fingertap commented 2 years ago

This config can reproduce this error:

# icnet_r50-d8_832x832_160k_ade20k.py
_base_ = [
    '../_base_/models/icnet_r50-d8.py',
    '../_base_/datasets/ade20k.py',
    '../_base_/default_runtime.py',
    '../_base_/schedules/schedule_160k.py'
]

norm_cfg = dict(type='SyncBN', requires_grad=True)

model = dict(
    type='EncoderDecoder',
    decode_head=dict(
        type='FCNHead',
        in_channels=128,
        channels=128,
        num_convs=1,
        in_index=2,
        dropout_ratio=0,
        num_classes=150,
        norm_cfg=norm_cfg,
        concat_input=False,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=[
        dict(
            type='FCNHead',
            in_channels=128,
            channels=128,
            num_convs=1,
            num_classes=150,
            in_index=0,
            norm_cfg=norm_cfg,
            concat_input=False,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='FCNHead',
            in_channels=128,
            channels=128,
            num_convs=1,
            num_classes=150,
            in_index=1,
            norm_cfg=norm_cfg,
            concat_input=False,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    ]
)
fingertap commented 2 years ago

All I did is changing the num_classes of decode heads and replace cityscapes with ade20k.

fingertap commented 2 years ago

This error occurs when reduce_zero_labels is not set.