ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6)

deepakkupanda commented 2 years ago

I am trying to run beit algorithm using duts dataset

tools/dist_train.sh configs/beit/upernet_beit-base_640x640_80k_duts_ms.py 1 --work-dir work_dirs/upernet_beit-base_640x640_80k_duts/ --deterministic

2022-05-26 12:51:23,588 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/work_dirs/upernet_beit-base_640x640_80k_duts by HardDiskBackend. 2022-05-26 12:54:05,915 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 240, in main() File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 229, in main train_segmentor( File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/apis/train.py", line 191, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], kwargs) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train outputs = self.model.train_step(data_batch, self.optimizer, kwargs) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 59, in train_step output = self.module.train_step(inputs[0], kwargs[0]) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step losses = self(data_batch) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func return old_func(args, kwargs) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward return self.forward_train(img, img_metas, kwargs) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train loss_decode = self._decode_head_forward_train(x, img_metas, File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train loss_decode = self.decode_head.forward_train(x, img_metas, File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train losses = self.losses(seg_logits, gt_semantic_seg) File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func return old_func(args, kwargs) File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses loss['acc_seg'] = accuracy( File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy correct = correct[:, target != ignore_index] RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646755897462/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f79ad35d1bd in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0x1f037 (0x7f79df9aa037 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x23a (0x7f79df9ae3ea in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x2ecd68 (0x7f7a303a3d68 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f79ad343fb5 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so) frame #5: + 0x1db609 (0x7f7a30292609 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x4c671c (0x7f7a3057d71c in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f7a3057da22 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x13e79b (0x564387c4079b in /anaconda/envs/open-mmlab/bin/python) frame #9: + 0x13de78 (0x564387c3fe78 in /anaconda/envs/open-mmlab/bin/python) frame #10: + 0x13dd53 (0x564387c3fd53 in /anaconda/envs/open-mmlab/bin/python) frame #11: + 0x13e0fc (0x564387c400fc in /anaconda/envs/open-mmlab/bin/python) frame #12: + 0x13ec11 (0x564387c40c11 in /anaconda/envs/open-mmlab/bin/python) frame #13: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python) frame #14: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python) frame #15: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python) frame #16: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python) frame #17: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python) frame #18: + 0x15673e (0x564387c5873e in /anaconda/envs/open-mmlab/bin/python) frame #19: PyDict_SetItemString + 0x64 (0x564387ca0e04 in /anaconda/envs/open-mmlab/bin/python) frame #20: + 0x28d46d (0x564387d8f46d in /anaconda/envs/open-mmlab/bin/python) frame #21: Py_FinalizeEx + 0x175 (0x564387d8f9c5 in /anaconda/envs/open-mmlab/bin/python) frame #22: Py_RunMain + 0x1af (0x564387d9440f in /anaconda/envs/open-mmlab/bin/python) frame #23: Py_BytesMain + 0x39 (0x564387d947d9 in /anaconda/envs/open-mmlab/bin/python) frame #24: __libc_start_main + 0xe7 (0x7f7a68a07bf7 in /lib/x86_64-linux-gnu/libc.so.6) frame #25: + 0x2125d4 (0x564387d145d4 in /anaconda/envs/open-mmlab/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 12206) of binary: /anaconda/envs/open-mmlab/bin/python tools/dist_train.sh: line 19: 12194 Segmentation fault (core dumped) python -m torch.distributed.launch --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --nproc_per_node=$GPUS --master_port=$PORT $(dirname "$0")/train.py $CONFIG --seed 0 --launcher pytorch ${@:3}

deepakkupanda commented 2 years ago

I am able to run the pspnet on ade20k dataset. tools/dist_train.sh configs/pspnet/pspnet_r101-d8_512x512_80k_ade20k.py 1

deepakkupanda commented 2 years ago

@xiaoachen98 Please help me in solving the error.

deepakkupanda commented 2 years ago

@donglixp Please help in solving the error.

MeowZheng commented 2 years ago

base on you error log,

"/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

I think something might be wrong about ignore_index. What's the ignore_index in your cfg?

deepakkupanda commented 2 years ago

@MeowZheng Can you tell where exactly to look for ignore_index ? I am not able to locate.

deepakkupanda commented 2 years ago

@MeowZheng Gentle reminder to relocate ignore_index

open-mmlab / mmsegmentation

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618