open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.44k stars 9.43k forks source link

[fp16 training error] CUDA error: device-side assert triggered #911

Closed zacurr closed 4 years ago

zacurr commented 5 years ago

Checklist

Describe the bug A clear and concise description of what the bug is. If there are any related issues or upstream bugs, please also refer to them.

Error traceback

  1. What command or script did you run?
    
    I run the following command to train mask_rcnn_r50_fpn_fp16
    ==============================================
    NUM_GPUS=4
    CONFIG=mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py'
    WORK_DIR=work_dirs/mask_rcnn_r50_fpn_fp16_1x' 

tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR

2. If applicable, paste the error trackback here using code blocks.

Because it is too long, i will paste it in the end.


**Reproduction details**
1. Did you make any modifications on the code? Did you understand what you have modified?
No

2. What dataset did you use?
COCO

**Environment**
 - OS:  Ubuntu 16.04.4
 - GCC  5.4.0
 - PyTorch version 1.1.0
- How you installed PyTorch : conda (inside docker)
- GPU model : V100 32GB (NVLink)
- CUDA and CUDNN version : CUDA 9.0 , cuDNN 7

When I try to train fp 16 model,
CUDA error: device-side assert triggered 
(insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)

and many repetitive following messages
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [80,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

when i comment out fp16 configuration , it doesn't produce error.
https://github.com/open-mmlab/mmdetection/blob/master/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py#L2

**Error message**
``/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
Directory exists
loading annotations into memory...
2019-07-02 00:56:29,056 - INFO - Distributed training: True
2019-07-02 00:56:29,549 - INFO - load model from: modelzoo://resnet50
loading annotations into memory...
2019-07-02 00:56:29,828 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer2.2.bn3.num_batches_tracked

loading annotations into memory...
loading annotations into memory...
Done (t=12.76s)
creating index...
Done (t=12.47s)
creating index...
index created!
Done (t=12.82s)
creating index...
index created!
index created!
Done (t=13.82s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=1.77s)
creating index...
index created!
Done (t=2.36s)
creating index...
Done (t=2.39s)
creating index...
index created!
index created!
Done (t=2.53s)
creating index...
index created!
2019-07-02 00:56:53,981 - INFO - Start running, host: root@b6940c72ef4f, work_dir: /home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
2019-07-02 00:56:53,981 - INFO - workflow: [('train', 1)], max: 12 epochs
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [96,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [5,0,0], thread: [98,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
... omitted...
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in <module>
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(**data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in <module>
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(**data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in <module>
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(**data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in <module>
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(**data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what():  CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fe9d572d66a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x140e0 (0x7fe9cf61b0e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fe9d571b661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fe9d4d160ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x1333fb (0x7fe9ed5f13fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x352ae4 (0x7fe9ed810ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x352b41 (0x7fe9ed810b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x19dbbc (0x5575e53ecbbc in /opt/conda/bin/python)
frame #8: <unknown function> + 0xf32a8 (0x5575e53422a8 in /opt/conda/bin/python)
frame #9: <unknown function> + 0xf343a (0x5575e534243a in /opt/conda/bin/python)
frame #10: <unknown function> + 0xf2c77 (0x5575e5341c77 in /opt/conda/bin/python)
frame #11: <unknown function> + 0xf2b07 (0x5575e5341b07 in /opt/conda/bin/python)
frame #12: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #13: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #14: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #15: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #16: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #17: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #18: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #19: <unknown function> + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x5575e5387d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5575e539084f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x5575e53f6b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x5575e5461961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x5575e546beb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x5575e5333b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fea04b54830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x1c61a8 (0x5575e54151a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what():  CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f6c09f7766a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x140e0 (0x7f6c03e650e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f6c09f65661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f6c095600ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x1333fb (0x7f6c21e3b3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x352ae4 (0x7f6c2205aae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x352b41 (0x7f6c2205ab41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x19dbbc (0x56145130ebbc in /opt/conda/bin/python)
frame #8: <unknown function> + 0xf32a8 (0x5614512642a8 in /opt/conda/bin/python)
frame #9: <unknown function> + 0xf343a (0x56145126443a in /opt/conda/bin/python)
frame #10: <unknown function> + 0xf2c77 (0x561451263c77 in /opt/conda/bin/python)
frame #11: <unknown function> + 0xf2b07 (0x561451263b07 in /opt/conda/bin/python)
frame #12: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #13: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #14: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #15: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #16: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #17: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #18: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #19: <unknown function> + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x5614512a9d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5614512b284f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x561451318b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x561451383961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x56145138deb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x561451255b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f6c3939e830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x1c61a8 (0x5614513371a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what():  CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7fa03010666a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x140e0 (0x7fa029ff40e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fa0300f4661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fa02f6ef0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x1333fb (0x7fa047fca3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x352ae4 (0x7fa0481e9ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x352b41 (0x7fa0481e9b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x19dbbc (0x564f6c1dabbc in /opt/conda/bin/python)
frame #8: <unknown function> + 0xf32a8 (0x564f6c1302a8 in /opt/conda/bin/python)
frame #9: <unknown function> + 0xf343a (0x564f6c13043a in /opt/conda/bin/python)
frame #10: <unknown function> + 0xf2c77 (0x564f6c12fc77 in /opt/conda/bin/python)
frame #11: <unknown function> + 0xf2b07 (0x564f6c12fb07 in /opt/conda/bin/python)
frame #12: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #13: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #14: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #15: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #16: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #17: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #18: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #19: <unknown function> + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x564f6c175d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x564f6c17e84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x564f6c1e4b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x564f6c24f961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x564f6c259eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x564f6c121b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fa05f52d830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x1c61a8 (0x564f6c2031a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what():  CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f23624b566a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x140e0 (0x7f235c3a30e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f23624a3661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f2361a9e0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x1333fb (0x7f237a3793fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x352ae4 (0x7f237a598ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x352b41 (0x7f237a598b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x19dbbc (0x55f393649bbc in /opt/conda/bin/python)
frame #8: <unknown function> + 0xf32a8 (0x55f39359f2a8 in /opt/conda/bin/python)
frame #9: <unknown function> + 0xf343a (0x55f39359f43a in /opt/conda/bin/python)
frame #10: <unknown function> + 0xf2c77 (0x55f39359ec77 in /opt/conda/bin/python)
frame #11: <unknown function> + 0xf2b07 (0x55f39359eb07 in /opt/conda/bin/python)
frame #12: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #13: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #14: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #15: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #16: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #17: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #18: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #19: <unknown function> + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x55f3935e4d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x55f3935ed84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x55f393653b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x55f3936be961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x55f3936c8eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x55f393590b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f23918dc830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x1c61a8 (0x55f3936721a8 in /opt/conda/bin/python)

Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
__main__, mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py', '--local_rank=0', '/home/user/Desktop/workspace_zacurr/mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py', '--launcher', 'pytorch', '--validate', '--work_dir', '/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x']' died with <Signals.SIGABRT: 6>.`
`
yhcao6 commented 5 years ago

i just ran the command

./tools/dist_train.sh configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py 4

But I did not get this error. Does this error appear every time?

zacurr commented 5 years ago

yes. There is an only trivial difference between commands that I and you have used. and Under the fp32 (default) setting, it doesn't have an error. I will try this on the other server (Cuda 10) when GPUs are not busy..maybe one week later

guaiwuguba commented 5 years ago

https://blog.csdn.net/sinat_29957455/article/details/95493564

gittigxuy commented 5 years ago

@zacurr,when I add random_scale function in extra_aug.py,I encounter the same problem,I guess the reason is that my bbox is out of range,Am I right?when I remove the random_scale function,the model trains normally

gittigxuy commented 5 years ago

@yhcao6,when will you release data-pipeline to the master ,I am waiting for this part,When I write my own function,I just finish rotate function,but the scale part encounter the above error,so what should I do?

yhcao6 commented 5 years ago

The data-pipeline will not add extra operations such as rotation. There maybe some problems in your code, could you give a minimum example to reproduce this error?

gittigxuy commented 5 years ago

@yhcao6,the code is reference by https://github.com/Paperspace/DataAugmentationForObjectDetection,and I have checked it and have no problem after data augment.when I use random_shift and random_scale it does not work.

gittigxuy commented 5 years ago

@yhcao6 ,I find the same problem in the pytorch issue,https://github.com/pytorch/pytorch/issues/21136,so,what should I do to fix this bug?

yhcao6 commented 5 years ago

Could you give me a minimum example to reproduce the bug? So that I can check if there is something wrong in your code. Or maybe there is a bug in this repo.

gittigxuy commented 5 years ago

I have sent the code to your gmail,waiting for your reply.Thanks

BlakeXiaochu commented 5 years ago

@gittigxuy Have you fixed your problem ? I met the same one when training on my custom dataset.

gittigxuy commented 5 years ago

no,did you change other code?just train your own data to get this error?I add some data augment function to get this error,if I do not change code,I could train normally

BlakeXiaochu commented 5 years ago

Yes, I changed the code to apply text detection. I converted the labels into coco format, use original CocoDataset, and no error occurred. But when I modified the code to add random scale and random crop, the error appears.

gittigxuy commented 5 years ago

same problem,waiting for author to deal with the problem,I have sent the code to him

BlakeXiaochu commented 5 years ago

Thx, if I fix it, I will tell you.

gittigxuy commented 5 years ago

which data augment did you add,could I add your QQ or wechat?I just add random_rotate and it works fine

BlakeXiaochu commented 5 years ago

@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks. So I guess your problem is caused by the same reason?

gittigxuy commented 5 years ago

thanks,meybe I meet the same probrem,so could you share your augment code to me?my email is 1262485779@qq.com

BlakeXiaochu commented 5 years ago

No problem @gittigxuy

yhcao6 commented 5 years ago

I have sent the code to your gmail,waiting for your reply.Thank

def clip_box(bbox, clip_box, alpha):
    ar_ = (bbox_area(bbox))
    x_min = np.maximum(bbox[:, 0], clip_box[0]).reshape(-1, 1)
    y_min = np.maximum(bbox[:, 1], clip_box[1]).reshape(-1, 1)
    x_max = np.minimum(bbox[:, 2], clip_box[2]).reshape(-1, 1)
    y_max = np.minimum(bbox[:, 3], clip_box[3]).reshape(-1, 1)

    bbox = np.hstack((x_min, y_min, x_max, y_max, bbox[:, 4:]))

    delta_area = ((ar_ - bbox_area(bbox)) / ar_)

    mask = (delta_area < (1 - alpha)).astype(int)

    bbox = bbox[mask == 1, :]

    return bbox

This is the clip_box function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.

gittigxuy commented 5 years ago

if i fix the code to your code,but get the same problem,what should I do?

SunNYNO1 commented 4 years ago

I meet the same problem, after I add some data into the dataset, I meet this error: `RuntimeError: merge_sort: failed to synchronize: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储) ` this is my annotations:

VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753person0012203680

I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,

guaiwuguba commented 4 years ago

Maybe you can print the labels to ensure the maxvalue be in line with the num_classes

------------------ 原始邮件 ------------------ 发件人: "sun"<notifications@github.com>; 发送时间: 2020年1月14日(星期二) 晚上11:31 收件人: "open-mmlab/mmdetection"<mmdetection@noreply.github.com>; 抄送: "郭彤彤"<1905919813@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [open-mmlab/mmdetection] [fp16 training error] CUDA error: device-side assert triggered (#911)

I meet the same problem, after I add some data into the dataset, I meet this error: `RuntimeError: merge_sort: failed to synchronize: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储) ` this is my annotations: VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753person0012203680

I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.