open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
8.34k stars 2.63k forks source link

Intermittent segfault errors #1807

Closed FranzEricSchneider closed 2 years ago

FranzEricSchneider commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help. (Yes, found no issues related to segfaults)
  2. The bug has not been fixed in the latest version. (Yes)

Describe the bug

I am trying to run mmsegmentation repeatedly in the mmseg Docker container, with variations. I am running on the same set of images, which are a dataset of my own labeled images. Every so often the training fails partway through with a segmentation fault error.

Note that this is the same system as https://github.com/open-mmlab/mmsegmentation/issues/1806 but with a different error, so a lot of the information will be the same.

Reproduction

  1. What command or script did you run?

I am running python tools/train.py /mmsegmentation/configs/{model_name}/{MCFG} --work-dir {DATA}{workdir}/ with a variety of model configs and unique workdirs. I do not know how to reproduce the error, it only appears intermittently. Any debug suggestions would be appreciated. The error appears to have shown up 3 times in 51 runs (most with 6k iterations, a few with 30k). I don't really have a clue as to how to begin debugging this.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

No modifications have been made to code, but as I've been testing variations a variety of modifications have been made to both the dataset config (trying a variety of augmentations) and the model config (BiSeNet-v2 and Segformer). I believe I understand the modifications made, in support of that the training runs to completion ~95% of the time. ~5% of the time it fails as described below.

  1. What dataset did you use?

I have a custom-labeled dataset with 2048x2448 images, 128 of them in img_dir/train/. It has 6 classes, and the decode heads of the model configs have been modified to reflect that.

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.
  2. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)
sys.platform: linux
Python: 3.7.7 (default, May  7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.6.0
MMCV: 1.3.13
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.26.0+891448f

Error traceback

2022-07-13 05:37:10,224 - mmseg - INFO - Iter [3400/6000]   lr: 4.766e-03, eta: 0:32:08, time: 0.703, data_time: 0.268, memory: 7978, decode.loss_ce: 0.2584, decode.acc_seg: 89.9144, aux_0.loss_ce: 0.4057, aux_0.acc_seg: 85.5554, aux_1.loss_ce: 0.3346, aux_1.acc_seg: 87.5833, aux_2.loss_ce: 0.3693, aux_2.acc_seg: 84.5047, aux_3.loss_ce: 0.4570, aux_3.acc_seg: 78.7329, loss: 1.8250
2022-07-13 05:37:45,615 - mmseg - INFO - Iter [3450/6000]   lr: 4.685e-03, eta: 0:31:29, time: 0.708, data_time: 0.273, memory: 7978, decode.loss_ce: 0.2573, decode.acc_seg: 89.9713, aux_0.loss_ce: 0.3945, aux_0.acc_seg: 86.0685, aux_1.loss_ce: 0.3284, aux_1.acc_seg: 87.7375, aux_2.loss_ce: 0.3652, aux_2.acc_seg: 84.6569, aux_3.loss_ce: 0.4518, aux_3.acc_seg: 78.8945, loss: 1.7973
2022-07-13 05:38:22,484 - mmseg - INFO - Saving checkpoint at 3500 iterations
2022-07-13 05:38:22,763 - mmseg - INFO - Iter [3500/6000]   lr: 4.604e-03, eta: 0:30:52, time: 0.744, data_time: 0.303, memory: 7978, decode.loss_ce: 0.2540, decode.acc_seg: 90.0477, aux_0.loss_ce: 0.3951, aux_0.acc_seg: 85.9992, aux_1.loss_ce: 0.3257, aux_1.acc_seg: 87.7952, aux_2.loss_ce: 0.3627, aux_2.acc_seg: 84.7408, aux_3.loss_ce: 0.4547, aux_3.acc_seg: 78.8021, loss: 1.7924
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 17/17, 1.4 task/s, elapsed: 13s, ETA:     0s[                                                  ] 0/17, elapsed: 0s, ETA:
[>>                                ] 1/17, 0.4 task/s, elapsed: 3s, ETA:    45s
[>>>>                              ] 2/17, 0.6 task/s, elapsed: 3s, ETA:    26s
[>>>>>>                            ] 3/17, 0.8 task/s, elapsed: 4s, ETA:    19s
[>>>>>>>>                          ] 4/17, 0.9 task/s, elapsed: 5s, ETA:    15s
[>>>>>>>>>>                        ] 5/17, 1.0 task/s, elapsed: 5s, ETA:    12s
[>>>>>>>>>>>>                      ] 6/17, 1.0 task/s, elapsed: 6s, ETA:    11s
[>>>>>>>>>>>>>>                    ] 7/17, 1.1 task/s, elapsed: 6s, ETA:     9s
[>>>>>>>>>>>>>>>>                  ] 8/17, 1.2 task/s, elapsed: 7s, ETA:     8s
[>>>>>>>>>>>>>>>>>>                ] 9/17, 1.2 task/s, elapsed: 7s, ETA:     7s
[>>>>>>>>>>>>>>>>>>>              ] 10/17, 1.2 task/s, elapsed: 8s, ETA:     6s
[>>>>>>>>>>>>>>>>>>>>>            ] 11/17, 1.3 task/s, elapsed: 9s, ETA:     5s
[>>>>>>>>>>>>>>>>>>>>>>>          ] 12/17, 1.3 task/s, elapsed: 9s, ETA:     4s
[>>>>>>>>>>>>>>>>>>>>>>>>        ] 13/17, 1.3 task/s, elapsed: 10s, ETA:     3s
[>>>>>>>>>>>>>>>>>>>>>>>>>>      ] 14/17, 1.3 task/s, elapsed: 11s, ETA:     2s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>    ] 15/17, 1.3 task/s, elapsed: 11s, ETA:     1s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ] 16/17, 1.4 task/s, elapsed: 12s, ETA:     1s
2022-07-13 05:38:35,215 - mmseg - INFO - per class results:
2022-07-13 05:38:35,216 - mmseg - INFO - 
+------------+-------+-------+
|   Class    |  IoU  |  Acc  |
+------------+-------+-------+
| background | 92.58 | 96.33 |
|    vine    | 61.44 |  77.8 |
|   trunk    | 50.89 | 55.85 |
|    post    | 63.34 | 88.58 |
|    leaf    | 28.53 | 29.81 |
|    sign    |  80.0 | 89.11 |
+------------+-------+-------+
2022-07-13 05:38:35,216 - mmseg - INFO - Summary:
2022-07-13 05:38:35,216 - mmseg - INFO - 
+-------+-------+-------+
|  aAcc |  mIoU |  mAcc |
+-------+-------+-------+
| 92.21 | 62.79 | 72.91 |
+-------+-------+-------+
2022-07-13 05:38:35,216 - mmseg - INFO - Iter(val) [17] aAcc: 0.9221, mIoU: 0.6279, mAcc: 0.7291, IoU.background: 0.9258, IoU.vine: 0.6144, IoU.trunk: 0.5089, IoU.post: 0.6334, IoU.leaf: 0.2853, IoU.sign: 0.8000, Acc.background: 0.9633, Acc.vine: 0.7780, Acc.trunk: 0.5585, Acc.post: 0.8858, Acc.leaf: 0.2981, Acc.sign: 0.8911
2022-07-13 05:39:10,652 - mmseg - INFO - Iter [3550/6000]   lr: 4.523e-03, eta: 0:30:23, time: 0.957, data_time: 0.522, memory: 7978, decode.loss_ce: 0.2495, decode.acc_seg: 90.1486, aux_0.loss_ce: 0.3914, aux_0.acc_seg: 86.2369, aux_1.loss_ce: 0.3207, aux_1.acc_seg: 87.9582, aux_2.loss_ce: 0.3573, aux_2.acc_seg: 84.9161, aux_3.loss_ce: 0.4454, aux_3.acc_seg: 79.0749, loss: 1.7643
2022-07-13 05:39:46,126 - mmseg - INFO - Iter [3600/6000]   lr: 4.442e-03, eta: 0:29:44, time: 0.709, data_time: 0.275, memory: 7978, decode.loss_ce: 0.2518, decode.acc_seg: 90.1142, aux_0.loss_ce: 0.3845, aux_0.acc_seg: 86.2842, aux_1.loss_ce: 0.3220, aux_1.acc_seg: 87.9527, aux_2.loss_ce: 0.3617, aux_2.acc_seg: 84.7242, aux_3.loss_ce: 0.4522, aux_3.acc_seg: 78.8049, loss: 1.7723
2022-07-13 05:40:23,086 - mmseg - INFO - Iter [3650/6000]   lr: 4.360e-03, eta: 0:29:07, time: 0.739, data_time: 0.304, memory: 7978, decode.loss_ce: 0.2526, decode.acc_seg: 90.1285, aux_0.loss_ce: 0.3876, aux_0.acc_seg: 86.2488, aux_1.loss_ce: 0.3207, aux_1.acc_seg: 88.0205, aux_2.loss_ce: 0.3574, aux_2.acc_seg: 84.9411, aux_3.loss_ce: 0.4482, aux_3.acc_seg: 79.0520, loss: 1.7664
2022-07-13 05:40:58,590 - mmseg - INFO - Iter [3700/6000]   lr: 4.279e-03, eta: 0:28:29, time: 0.710, data_time: 0.276, memory: 7978, decode.loss_ce: 0.2537, decode.acc_seg: 90.0376, aux_0.loss_ce: 0.3896, aux_0.acc_seg: 86.1735, aux_1.loss_ce: 0.3228, aux_1.acc_seg: 87.8744, aux_2.loss_ce: 0.3622, aux_2.acc_seg: 84.6782, aux_3.loss_ce: 0.4531, aux_3.acc_seg: 78.7201, loss: 1.7814
2022-07-13 05:41:33,966 - mmseg - INFO - Iter [3750/6000]   lr: 4.197e-03, eta: 0:27:51, time: 0.708, data_time: 0.273, memory: 7978, decode.loss_ce: 0.2489, decode.acc_seg: 90.3267, aux_0.loss_ce: 0.3862, aux_0.acc_seg: 86.3536, aux_1.loss_ce: 0.3177, aux_1.acc_seg: 88.2530, aux_2.loss_ce: 0.3564, aux_2.acc_seg: 85.1786, aux_3.loss_ce: 0.4450, aux_3.acc_seg: 79.3686, loss: 1.7543
2022-07-13 05:42:11,189 - mmseg - INFO - Iter [3800/6000]   lr: 4.115e-03, eta: 0:27:13, time: 0.744, data_time: 0.310, memory: 7978, decode.loss_ce: 0.2491, decode.acc_seg: 90.2439, aux_0.loss_ce: 0.4027, aux_0.acc_seg: 85.8286, aux_1.loss_ce: 0.3309, aux_1.acc_seg: 87.7349, aux_2.loss_ce: 0.3590, aux_2.acc_seg: 84.9446, aux_3.loss_ce: 0.4413, aux_3.acc_seg: 79.3486, loss: 1.7830
2022-07-13 05:42:46,540 - mmseg - INFO - Iter [3850/6000]   lr: 4.033e-03, eta: 0:26:35, time: 0.707, data_time: 0.272, memory: 7978, decode.loss_ce: 0.2555, decode.acc_seg: 90.0167, aux_0.loss_ce: 0.3989, aux_0.acc_seg: 85.8427, aux_1.loss_ce: 0.3296, aux_1.acc_seg: 87.6695, aux_2.loss_ce: 0.3643, aux_2.acc_seg: 84.7177, aux_3.loss_ce: 0.4525, aux_3.acc_seg: 78.7560, loss: 1.8007
2022-07-13 05:43:21,780 - mmseg - INFO - Iter [3900/6000]   lr: 3.950e-03, eta: 0:25:57, time: 0.705, data_time: 0.270, memory: 7978, decode.loss_ce: 0.2559, decode.acc_seg: 89.9446, aux_0.loss_ce: 0.3918, aux_0.acc_seg: 86.1086, aux_1.loss_ce: 0.3286, aux_1.acc_seg: 87.7121, aux_2.loss_ce: 0.3675, aux_2.acc_seg: 84.4662, aux_3.loss_ce: 0.4551, aux_3.acc_seg: 78.5672, loss: 1.7989
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
    data = next(self.iter_loader)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 964, in _next_data
    raise StopIteration
StopIteration

Bug fix

I do not have a bug fix. However, I think there's something really interesting, which is that the validation checking (17 images every 500 iters in the run posted above) appears to have a lagging image. You can see that it processes up to 16/17, then does a whole bunch of other training, then processes 17, then crashes.

Actually though, I went back to a passing training and it also exhibits this behavior. This training run ran to completion:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>    ] 15/17, 1.3 task/s, elapsed: 11s, ETA:     2s
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ] 16/17, 1.3 task/s, elapsed: 12s, ETA:     1s
2022-07-13 20:31:12,197 - mmseg - INFO - per class results:
2022-07-13 20:31:12,198 - mmseg - INFO - 
+------------+-------+-------+
|   Class    |  IoU  |  Acc  |
+------------+-------+-------+
| background | 92.17 | 98.09 |
|    vine    | 56.17 | 63.13 |
|   trunk    |  63.2 | 75.52 |
|    post    | 67.57 | 77.68 |
|    leaf    |  34.1 | 39.26 |
|    sign    | 81.48 | 89.06 |
+------------+-------+-------+
2022-07-13 20:31:12,198 - mmseg - INFO - Summary:
2022-07-13 20:31:12,199 - mmseg - INFO - 
+-------+-------+-------+
|  aAcc |  mIoU |  mAcc |
+-------+-------+-------+
| 92.19 | 65.78 | 73.79 |
+-------+-------+-------+
2022-07-13 20:31:12,199 - mmseg - INFO - Exp name: model_config.py
2022-07-13 20:31:12,199 - mmseg - INFO - Iter(val) [17] aAcc: 0.9219, mIoU: 0.6578, mAcc: 0.7379, IoU.background: 0.9217, IoU.vine: 0.5617, IoU.trunk: 0.6320, IoU.post: 0.6757, IoU.leaf: 0.3410, IoU.sign: 0.8148, Acc.background: 0.9809, Acc.vine: 0.6313, Acc.trunk: 0.7552, Acc.post: 0.7768, Acc.leaf: 0.3926, Acc.sign: 0.8906
2022-07-13 20:31:51,521 - mmseg - INFO - Iter [1050/6000]   lr: 8.428e-03, eta: 1:08:04, time: 1.036, data_time: 0.600, memory: 7978, decode.loss_ce: 0.2987, decode.acc_seg: 87.1081, aux_0.loss_ce: 0.3554, aux_0.acc_seg: 86.2538, aux_1.loss_ce: 0.3429, aux_1.acc_seg: 85.6464, aux_2.loss_ce: 0.4007, aux_2.acc_seg: 81.6827, aux_3.loss_ce: 0.4809, aux_3.acc_seg: 77.5501, loss: 1.8786
2022-07-13 20:32:33,179 - mmseg - INFO - Iter [1100/6000]   lr: 8.352e-03, eta: 1:07:25, time: 0.833, data_time: 0.396, memory: 7978, decode.loss_ce: 0.3005, decode.acc_seg: 87.0382, aux_0.loss_ce: 0.3486, aux_0.acc_seg: 86.5057, aux_1.loss_ce: 0.3397, aux_1.acc_seg: 85.6771, aux_2.loss_ce: 0.4002, aux_2.acc_seg: 81.6403, aux_3.loss_ce: 0.4844, aux_3.acc_seg: 77.3029, loss: 1.8734
2022-07-13 20:33:12,453 - mmseg - INFO - Iter [1150/6000]   lr: 8.276e-03, eta: 1:06:35, time: 0.785, data_time: 0.349, memory: 7978, decode.loss_ce: 0.3001, decode.acc_seg: 86.9343, aux_0.loss_ce: 0.3549, aux_0.acc_seg: 86.2750, aux_1.loss_ce: 0.3505, aux_1.acc_seg: 85.2047, aux_2.loss_ce: 0.4087, aux_2.acc_seg: 81.2553, aux_3.loss_ce: 0.4862, aux_3.acc_seg: 77.3371, loss: 1.9005
2022-07-13 20:33:51,638 - mmseg - INFO - Iter [1200/6000]   lr: 8.200e-03, eta: 1:05:46, time: 0.784, data_time: 0.347, memory: 7978, decode.loss_ce: 0.2933, decode.acc_seg: 87.2231, aux_0.loss_ce: 0.3426, aux_0.acc_seg: 86.7179, aux_1.loss_ce: 0.3369, aux_1.acc_seg: 85.6778, aux_2.loss_ce: 0.3962, aux_2.acc_seg: 81.7005, aux_3.loss_ce: 0.4765, aux_3.acc_seg: 77.6125, loss: 1.8455
2022-07-13 20:34:33,427 - mmseg - INFO - Iter [1250/6000]   lr: 8.124e-03, eta: 1:05:07, time: 0.836, data_time: 0.399, memory: 7978, decode.loss_ce: 0.2989, decode.acc_seg: 86.9466, aux_0.loss_ce: 0.3415, aux_0.acc_seg: 86.6343, aux_1.loss_ce: 0.3402, aux_1.acc_seg: 85.5249, aux_2.loss_ce: 0.3998, aux_2.acc_seg: 81.4783, aux_3.loss_ce: 0.4803, aux_3.acc_seg: 77.3864, loss: 1.8607
2022-07-13 20:35:13,031 - mmseg - INFO - Iter [1300/6000]   lr: 8.048e-03, eta: 1:04:21, time: 0.792, data_time: 0.356, memory: 7978, decode.loss_ce: 0.2893, decode.acc_seg: 87.3710, aux_0.loss_ce: 0.3321, aux_0.acc_seg: 87.0348, aux_1.loss_ce: 0.3296, aux_1.acc_seg: 85.9881, aux_2.loss_ce: 0.3913, aux_2.acc_seg: 81.9107, aux_3.loss_ce: 0.4709, aux_3.acc_seg: 77.9124, loss: 1.8131
2022-07-13 20:35:52,299 - mmseg - INFO - Iter [1350/6000]   lr: 7.972e-03, eta: 1:03:34, time: 0.785, data_time: 0.349, memory: 7978, decode.loss_ce: 0.2898, decode.acc_seg: 87.3075, aux_0.loss_ce: 0.3336, aux_0.acc_seg: 87.0116, aux_1.loss_ce: 0.3314, aux_1.acc_seg: 85.9027, aux_2.loss_ce: 0.3940, aux_2.acc_seg: 81.7306, aux_3.loss_ce: 0.4780, aux_3.acc_seg: 77.5172, loss: 1.8267
2022-07-13 20:36:33,848 - mmseg - INFO - Iter [1400/6000]   lr: 7.896e-03, eta: 1:02:54, time: 0.831, data_time: 0.395, memory: 7978, decode.loss_ce: 0.2929, decode.acc_seg: 87.1985, aux_0.loss_ce: 0.3459, aux_0.acc_seg: 86.4321, aux_1.loss_ce: 0.3365, aux_1.acc_seg: 85.6730, aux_2.loss_ce: 0.3985, aux_2.acc_seg: 81.4676, aux_3.loss_ce: 0.4780, aux_3.acc_seg: 77.3928, loss: 1.8518
2022-07-13 20:37:13,182 - mmseg - INFO - Iter [1450/6000]   lr: 7.820e-03, eta: 1:02:08, time: 0.787, data_time: 0.351, memory: 7978, decode.loss_ce: 0.2852, decode.acc_seg: 87.4626, aux_0.loss_ce: 0.3338, aux_0.acc_seg: 86.8644, aux_1.loss_ce: 0.3277, aux_1.acc_seg: 86.0022, aux_2.loss_ce: 0.3880, aux_2.acc_seg: 81.9574, aux_3.loss_ce: 0.4684, aux_3.acc_seg: 77.9577, loss: 1.8031
2022-07-13 20:37:52,526 - mmseg - INFO - Saving checkpoint at 1500 iterations
2022-07-13 20:37:52,811 - mmseg - INFO - Iter [1500/6000]   lr: 7.743e-03, eta: 1:01:23, time: 0.794, data_time: 0.352, memory: 7978, decode.loss_ce: 0.2887, decode.acc_seg: 87.4694, aux_0.loss_ce: 0.3361, aux_0.acc_seg: 86.9807, aux_1.loss_ce: 0.3320, aux_1.acc_seg: 85.9744, aux_2.loss_ce: 0.3936, aux_2.acc_seg: 81.8781, aux_3.loss_ce: 0.4738, aux_3.acc_seg: 77.7531, loss: 1.8243
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 17/17, 1.4 task/s, elapsed: 12s, ETA:     0s[                                                  ] 0/17, elapsed: 0s, ETA:
[>>                                ] 1/17, 0.3 task/s, elapsed: 3s, ETA:    47s
MengzhangLI commented 2 years ago

Thanks for your feedback. Perhaps it is caused by your local limited computational resources on large 2048x2448 shape images. Does this error happened on smaller size dataset?

FranzEricSchneider commented 2 years ago

That shouldn't be an issue, the network is not actually training on the full images. In the dataset augmentation I have

crop_size = (480, 512)
...
train_pipeline = [
    ...
    dict(type="RandomCrop", crop_size=crop_size, cat_max_ratio=0.75),
    ...
]

I've also checked that the training images actually use this smaller size. Sorry for not making that clear in the initial post.

FranzEricSchneider commented 2 years ago

Here's another example from this morning where ERROR: Unexpected segmentation fault encountered in worker. appears the same but the context is different.

2022-07-22 06:26:03,194 - mmseg - INFO - Iter [1050/6000]   lr: 8.428e-03, eta: 1:01:23, time: 2.551, data_time: 2.105, memory: 7978, decode.loss_ce: 0.3697, decode.acc_seg: 85.0045, aux_0.loss_ce: 0.4535, aux_0.acc_seg: 83.6738, aux_1.loss_ce: 0.4258, aux_1.acc_seg: 83.2825, aux_2.loss_ce: 0.4692, aux_2.acc_seg: 79.8277, aux_3.loss_ce: 0.5278, aux_3.acc_seg: 76.5396, loss: 2.2459
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "tools/train.py", line 242, in <module>
    main()
  File "tools/train.py", line 238, in main
    meta=meta)
  File "/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
    losses = self(**data_batch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
    gt_semantic_seg)
  File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
    self.train_cfg)
  File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
    losses = self.losses(seg_logits, gt_semantic_seg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
    return old_func(*args, **kwargs)
  File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 265, in losses
    seg_logit, seg_label, ignore_index=self.ignore_index)
  File "/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3234) is killed by signal: Segmentation fault. 
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 153/153, 1.5 task/s, elapsed: 99s, ETA:     0sFailed to detect content-type automatically for artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.
Added application/json as content-type of artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.json.
FranzEricSchneider commented 2 years ago

The project where I was running into these is no longer active, so I don't have any new information. If anyone has any ideas feel free to post, but I'll close this now.