Closed zacurr closed 4 years ago
i just ran the command
./tools/dist_train.sh configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py 4
But I did not get this error. Does this error appear every time?
yes. There is an only trivial difference between commands that I and you have used. and Under the fp32 (default) setting, it doesn't have an error. I will try this on the other server (Cuda 10) when GPUs are not busy..maybe one week later
@zacurr,when I add random_scale function in extra_aug.py,I encounter the same problem,I guess the reason is that my bbox is out of range,Am I right?when I remove the random_scale function,the model trains normally
@yhcao6,when will you release data-pipeline to the master ,I am waiting for this part,When I write my own function,I just finish rotate function,but the scale part encounter the above error,so what should I do?
The data-pipeline will not add extra operations such as rotation. There maybe some problems in your code, could you give a minimum example to reproduce this error?
@yhcao6,the code is reference by https://github.com/Paperspace/DataAugmentationForObjectDetection,and I have checked it and have no problem after data augment.when I use random_shift and random_scale it does not work.
@yhcao6 ,I find the same problem in the pytorch issue,https://github.com/pytorch/pytorch/issues/21136,so,what should I do to fix this bug?
Could you give me a minimum example to reproduce the bug? So that I can check if there is something wrong in your code. Or maybe there is a bug in this repo.
I have sent the code to your gmail,waiting for your reply.Thanks
@gittigxuy Have you fixed your problem ? I met the same one when training on my custom dataset.
no,did you change other code?just train your own data to get this error?I add some data augment function to get this error,if I do not change code,I could train normally
Yes, I changed the code to apply text detection. I converted the labels into coco format, use original CocoDataset, and no error occurred. But when I modified the code to add random scale and random crop, the error appears.
same problem,waiting for author to deal with the problem,I have sent the code to him
Thx, if I fix it, I will tell you.
which data augment did you add,could I add your QQ or wechat?I just add random_rotate and it works fine
@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks. So I guess your problem is caused by the same reason?
thanks,meybe I meet the same probrem,so could you share your augment code to me?my email is 1262485779@qq.com
No problem @gittigxuy
I have sent the code to your gmail,waiting for your reply.Thank
def clip_box(bbox, clip_box, alpha):
ar_ = (bbox_area(bbox))
x_min = np.maximum(bbox[:, 0], clip_box[0]).reshape(-1, 1)
y_min = np.maximum(bbox[:, 1], clip_box[1]).reshape(-1, 1)
x_max = np.minimum(bbox[:, 2], clip_box[2]).reshape(-1, 1)
y_max = np.minimum(bbox[:, 3], clip_box[3]).reshape(-1, 1)
bbox = np.hstack((x_min, y_min, x_max, y_max, bbox[:, 4:]))
delta_area = ((ar_ - bbox_area(bbox)) / ar_)
mask = (delta_area < (1 - alpha)).astype(int)
bbox = bbox[mask == 1, :]
return bbox
This is the clip_box
function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.
if i fix the code to your code,but get the same problem,what should I do?
I meet the same problem, after I add some data into the dataset, I meet this error: `RuntimeError: merge_sort: failed to synchronize: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)
已放弃 (核心已转储) ` this is my annotations:
I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,
Maybe you can print the labels to ensure the maxvalue be in line with the num_classes
------------------ 原始邮件 ------------------ 发件人: "sun"<notifications@github.com>; 发送时间: 2020年1月14日(星期二) 晚上11:31 收件人: "open-mmlab/mmdetection"<mmdetection@noreply.github.com>; 抄送: "郭彤彤"<1905919813@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [open-mmlab/mmdetection] [fp16 training error] CUDA error: device-side assert triggered (#911)
I meet the same problem, after I add some data into the dataset, I meet this error: `RuntimeError: merge_sort: failed to synchronize: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so) frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)
已放弃 (核心已转储) ` this is my annotations: VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753person0012203680
I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Checklist
Describe the bug A clear and concise description of what the bug is. If there are any related issues or upstream bugs, please also refer to them.
Error traceback
tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR
Because it is too long, i will paste it in the end.