Closed FranzEricSchneider closed 2 years ago
Thanks for your feedback. Perhaps it is caused by your local limited computational resources on large 2048x2448
shape images. Does this error happened on smaller size dataset?
That shouldn't be an issue, the network is not actually training on the full images. In the dataset augmentation I have
crop_size = (480, 512)
...
train_pipeline = [
...
dict(type="RandomCrop", crop_size=crop_size, cat_max_ratio=0.75),
...
]
I've also checked that the training images actually use this smaller size. Sorry for not making that clear in the initial post.
Here's another example from this morning where ERROR: Unexpected segmentation fault encountered in worker.
appears the same but the context is different.
2022-07-22 06:26:03,194 - mmseg - INFO - Iter [1050/6000] lr: 8.428e-03, eta: 1:01:23, time: 2.551, data_time: 2.105, memory: 7978, decode.loss_ce: 0.3697, decode.acc_seg: 85.0045, aux_0.loss_ce: 0.4535, aux_0.acc_seg: 83.6738, aux_1.loss_ce: 0.4258, aux_1.acc_seg: 83.2825, aux_2.loss_ce: 0.4692, aux_2.acc_seg: 79.8277, aux_3.loss_ce: 0.5278, aux_3.acc_seg: 76.5396, loss: 2.2459
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "tools/train.py", line 242, in <module>
main()
File "tools/train.py", line 238, in main
meta=meta)
File "/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
iter_runner(iter_loaders[i], **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 144, in forward_train
gt_semantic_seg)
File "/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 88, in _decode_head_forward_train
self.train_cfg)
File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 265, in losses
seg_logit, seg_label, ignore_index=self.ignore_index)
File "/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3234) is killed by signal: Segmentation fault.
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 153/153, 1.5 task/s, elapsed: 99s, ETA: 0sFailed to detect content-type automatically for artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.
Added application/json as content-type of artifact /home/eric/Desktop/SEMSEGTEST/WORKDIR_1658470371250420/20220722_061257.log.json.
The project where I was running into these is no longer active, so I don't have any new information. If anyone has any ideas feel free to post, but I'll close this now.
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
I am trying to run mmsegmentation repeatedly in the mmseg Docker container, with variations. I am running on the same set of images, which are a dataset of my own labeled images. Every so often the training fails partway through with a
segmentation fault
error.Note that this is the same system as https://github.com/open-mmlab/mmsegmentation/issues/1806 but with a different error, so a lot of the information will be the same.
Reproduction
I am running
python tools/train.py /mmsegmentation/configs/{model_name}/{MCFG} --work-dir {DATA}{workdir}/
with a variety of model configs and unique workdirs. I do not know how to reproduce the error, it only appears intermittently. Any debug suggestions would be appreciated. The error appears to have shown up 3 times in 51 runs (most with 6k iterations, a few with 30k). I don't really have a clue as to how to begin debugging this.No modifications have been made to code, but as I've been testing variations a variety of modifications have been made to both the dataset config (trying a variety of augmentations) and the model config (BiSeNet-v2 and Segformer). I believe I understand the modifications made, in support of that the training runs to completion ~95% of the time. ~5% of the time it fails as described below.
I have a custom-labeled dataset with 2048x2448 images, 128 of them in
img_dir/train/
. It has 6 classes, and the decode heads of the model configs have been modified to reflect that.Environment
python mmseg/utils/collect_env.py
to collect necessary environment information and paste it here.$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback
Bug fix
I do not have a bug fix. However, I think there's something really interesting, which is that the validation checking (17 images every 500 iters in the run posted above) appears to have a lagging image. You can see that it processes up to 16/17, then does a whole bunch of other training, then processes 17, then crashes.
Actually though, I went back to a passing training and it also exhibits this behavior. This training run ran to completion: