Problem when training with custom dataset

leonokida commented 1 year ago

Hello, I'm trying to train the pytorch version with a custom dataset, but I'm having this problem:
/home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [24,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [25,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [26,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [27,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [28,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [29,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [30,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. /home/coulombc/wheels_builder/tmp.29119/python-3.11/torch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [31,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed. Traceback (most recent call last): File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/train.py", line 48, in <module> main(args) File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/train.py", line 34, in main solver.fit() File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/../src/solver/det_solver.py", line 37, in fit train_stats = train_one_epoch( ^^^^^^^^^^^^^^^^ File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/../src/solver/det_engine.py", line 59, in train_one_epoch loss_dict = criterion(outputs, targets) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/leonokid/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/../src/zoo/rtdetr/rtdetr_criterion.py", line 238, in forward indices = self.matcher(outputs_without_aux, targets) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/leonokid/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/leonokid/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/../src/zoo/rtdetr/matcher.py", line 99, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre04/scratch/leonokid/rtdetr_pytorch/tools/../src/zoo/rtdetr/box_ops.py", line 51, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() RuntimeError: CUDA error: device-side assert triggered

It seems related to this line in box_ops.py: # degenerate boxes gives inf / nan results # so do an early check assert (boxes1[:, 2:] >= boxes1[:, :2]).all() Any idea of what's causing the problem and how to solve it? Thanks in advance!

lyuwenyu commented 1 year ago

You can set device to cpu by default to identify specific errors .
https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetr_pytorch/src/core/config.py#L75

Borobo commented 1 year ago

I had the same issue, thought it was because my bboxes were in xyxy format but coco bboxes format is xywh, so I've changed it but i'm getting another error concerning "out_prob" now.

lyuwenyu commented 1 year ago

I had the same issue, thought it was because my bboxes were in xyxy format but coco bboxes format is xywh, so I've changed it but i'm getting another error concerning "out_prob" now.

Yes, for bbox, data flow is dataset -> normalized cxcywh -> model

Borobo commented 1 year ago

Thank you, it's working now for me !

leonokida commented 1 year ago

I set the device to cpu and I got this now: out_prob = out_prob[:, tgt_ids] ~~~~~~~~^^^^^^^^^^^^ IndexError: index 1 is out of bounds for dimension 0 with size 1 Maybe has it something to do with the dataset format? I'm exporting my dataset in Roboflow using the JSON COCO format.

lyuwenyu commented 1 year ago

You can print(out_prob.shape, tgt_ids) before this line, to get more information.

If your categories id start with 0, set remap_mscoco_category: False in config.

leonokida commented 1 year ago

The print outputs this: torch.Size([1200, 1]) tensor([1, 1, 1, 1]) I have already set remap_mscoco_category: False, so I don't think that's the cause of the problem. Here is my config file, that is in configs/dataset/ (I changed it to txt so I could annex it here, but it's actually a yml): smoke.txt

lyuwenyu commented 1 year ago

If you has only 1 class, tgt_ids should be tensor([0, 0, 0, 0])

GiattiChen commented 4 months ago

You can print(out_prob.shape, tgt_ids) before this line, to get more information.

If your categories id start with 0, set remap_mscoco_category: False in config.

That solve my problem. Thanks a lot. I set remap_mscoco_category as True (because my index starts from 1) but my classes are quite different from coco . Will that influences the accuracy?

lyuwenyu / RT-DETR

Problem when training with custom dataset #99