ultralytics / ultralytics

Ultralytics YOLO11 πŸš€
https://docs.ultralytics.com
GNU Affero General Public License v3.0
32.43k stars 6.23k forks source link

Crashing with a large amount of background images #17234

Open eVen-gits opened 2 weeks ago

eVen-gits commented 2 weeks ago

Search before asking

Ultralytics YOLO Component

No response

Bug

I am working with large images, and I am using sahi to slice these images into tiles. This produces a COCO style dataset.

This dataset is then transformed to a yolo dataset, using sahi coco yolov5 command. This produces a yolo style dataset that still needs preprocessing/fixes. Here is the general procedure:

  1. Slice the dataset using sahi yolo slice
  2. Transform the dataset format from COCO to YOLO using sahi coco yolov5
  3. Split the data in train/test/val, using a custom script
  4. Fix yolo annotations, using a custom script (subtract 1 from class numbers)
  5. Move files into corresponding folders
  6. Update and rename data.yml to data.yaml and fix paths
  7. Attempt to train.

The crash appears to be happening here:

[rank0]:   File "<path>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 113, in get_box_metrics
[rank0]:     bbox_scores[mask_gt] = pd_scores[ind[0], :, ind[1]][mask_gt]  # b, max_num_obj, h*w
Longer output (some repeated CUDA spam ommited) ``` >>> yolo train \ batch=64 device=0,1 imgsz=640 epochs=100 patience=100 \ model=yolo11l \ data=/data.yaml Ultralytics 8.3.24 πŸš€ Python-3.12.7 torch-2.4.1+cu121 CUDA:0 (NVIDIA RTX A6000, 48570MiB) CUDA:1 (NVIDIA RTX A6000, 48570MiB) engine/trainer: task=detect, mode=train, model=yolo11l, data=/runs/coco2yolov5/exp/data.yaml, epochs=100, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=False, device=(0, 1), workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/runs/detect/train2 Overriding model.yaml nc=80 with nc=3 from n params module arguments 0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2] 1 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 2 -1 2 173824 ultralytics.nn.modules.block.C3k2 [128, 256, 2, True, 0.25] 3 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2] 4 -1 2 691712 ultralytics.nn.modules.block.C3k2 [256, 512, 2, True, 0.25] 5 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2] 6 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True] 7 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2] 8 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True] 9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5] 10 -1 2 1455616 ultralytics.nn.modules.block.C2PSA [512, 512, 2] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 13 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True] 14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 16 -1 2 756736 ultralytics.nn.modules.block.C3k2 [1024, 256, 2, True] 17 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2] 18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1] 19 -1 2 2365440 ultralytics.nn.modules.block.C3k2 [768, 512, 2, True] 20 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2] 21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1] 22 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True] 23 [16, 19, 22] 1 1413337 ultralytics.nn.modules.head.Detect [3, [256, 512, 512]] YOLO11l summary: 631 layers, 25,312,793 parameters, 25,312,777 gradients, 87.3 GFLOPs Transferred 1009/1015 items from pretrained weights DDP: debug command /usr/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 52895 /.config/Ultralytics/DDP/_temp_b9d5pmku125748658656832.py Ultralytics 8.3.24 πŸš€ Python-3.12.7 torch-2.4.1+cu121 CUDA:0 (NVIDIA RTX A6000, 48570MiB) CUDA:1 (NVIDIA RTX A6000, 48570MiB) Overriding model.yaml nc=80 with nc=3 Transferred 1009/1015 items from pretrained weights Freezing layer 'model.23.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks... AMP: checks passed βœ… train: Scanning /runs/coco2yolov5/exp/train/labels.cache... 76000 images, 69060 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 76000/76000 [00:00/runs/coco2yolov5/exp/val/labels.cache... 9500 images, 8642 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9500/9500 [00:00/labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0) Image sizes 640 train, 640 val Using 16 dataloader workers Logging results to /runs/detect/train2 Starting training for 100 epochs... Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/100 20.9G 2.839 46.38 2.456 5 640: 4%|▍ | 47/1188 [00:48<18:51, 1.01it/s] ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [61,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [177,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. 1/100 20.9G 2.311 21.17 2.053 15 640: 15%|β–ˆβ– | 174/1188 [02:55<17:02, 1.01s/it] [rank0]: Traceback (most recent call last): [rank0]: File ".config/Ultralytics/DDP/_temp_b9d5pmku125748658656832.py", line 13, in [rank0]: results = trainer.train() [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 207, in train [rank0]: self._do_train(world_size) [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 385, in _do_train [rank0]: self.loss, self.loss_items = self.model(batch) [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward [rank0]: else self._run_ddp_forward(*inputs, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward [rank0]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 111, in forward [rank0]: return self.loss(x, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 293, in loss [rank0]: return self.criterion(preds, batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/utils/loss.py", line 234, in __call__ [rank0]: _, target_bboxes, target_scores, fg_mask, _ = self.assigner( [rank0]: ^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 72, in forward [rank0]: mask_pos, align_metric, overlaps = self.get_pos_mask( [rank0]: ^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 94, in get_pos_mask [rank0]: align_metric, overlaps = self.get_box_metrics(pd_scores, pd_bboxes, gt_labels, gt_bboxes, mask_in_gts * mask_gt) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File ".local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 113, in get_box_metrics [rank0]: bbox_scores[mask_gt] = pd_scores[ind[0], :, ind[1]][mask_gt] # b, max_num_obj, h*w [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^ [rank0]: RuntimeError: CUDA error: device-side assert triggered [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [67,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [68,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [70,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [71,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [72,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [73,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [74,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [75,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [76,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [77,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [78,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [79,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [80,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [81,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [82,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [83,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [84,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [85,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [86,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [87,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [88,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [89,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [90,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [91,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [92,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [93,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [94,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [176,0,0], thread: [95,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb0f2933f86 in .local/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb0f28e2d10 in .local/lib/python3.12/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb0f2a0ef08 in .local/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1da06 (0x7fb0f29d9a06 in .local/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0x1f783 (0x7fb0f29db783 in .local/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #5: + 0x1fac2 (0x7fb0f29dbac2 in .local/lib/python3.12/site-packages/torch/lib/libc10_cuda.so) frame #6: + 0x5dd3d0 (0x7fb0f05dd3d0 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x6abdf (0x7fb0f2917bdf in .local/lib/python3.12/site-packages/torch/lib/libc10.so) frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fb0f2910c3b in .local/lib/python3.12/site-packages/torch/lib/libc10.so) frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb0f2910de9 in .local/lib/python3.12/site-packages/torch/lib/libc10.so) frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7fb0ddb263b4 in .local/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so) frame #11: std::_Sp_counted_ptr::_M_dispose() + 0x12 (0x7fb0f0d67c22 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fb0f04a3858 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #13: + 0xd72b91 (0x7fb0f0d72b91 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #14: + 0x4ae752 (0x7fb0f04ae752 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #15: + 0x4af751 (0x7fb0f04af751 in .local/lib/python3.12/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x25e08 (0x7fb0f7034e08 in /usr/lib/libc.so.6) frame #26: __libc_start_main + 0x8c (0x7fb0f7034ecc in /usr/lib/libc.so.6) frame #27: _start + 0x25 (0x56daa4397045 in /usr/bin/python) W1029 10:54:04.754000 125546625919872 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 593969 closing signal SIGTERM E1029 10:54:05.319000 125546625919872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 593968) of binary: /usr/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File ".local/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in main() File ".local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main run(args) File ".local/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File ".local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ .config/Ultralytics/DDP/_temp_b9d5pmku125748658656832.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-29_10:54:04 host : hostname rank : 0 (local_rank: 0) exitcode : -6 (pid: 593968) error_file: traceback : Signal 6 (SIGABRT) received by PID 593968 ============================================================ Traceback (most recent call last): File ".local/bin/yolo", line 8, in sys.exit(entrypoint()) ^^^^^^^^^^^^ File ".local/lib/python3.12/site-packages/ultralytics/cfg/__init__.py", line 826, in entrypoint getattr(model, mode)(**overrides) # default args from model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".local/lib/python3.12/site-packages/ultralytics/engine/model.py", line 802, in train self.trainer.train() File ".local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 202, in train raise e File ".local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 200, in train subprocess.run(cmd, check=True) File "/usr/lib/python3.12/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '52895', '/home/cuda/.config/Ultralytics/DDP/_temp_b9d5pmku125748658656832.py']' returned non-zero exit status 1. Sentry is attempting to send 2 pending events Waiting up to 2 seconds Press Ctrl-C to quit ```

More information on above procedures:

1) Slice the dataset

Using command

sahi coco slice \
--image_dir <data_path> \
--dataset_json_path <json_path> \
--slice_size 640 --overlap_ratio 0.2

2) Transform the dataset format from COCO to YOLO

sahi coco yolov5 \
--image_dir <slices_path> \
--dataset_json_path <sliced_coco_json_path> \
--train_split 1

3) Split the data in train/test/val


import os
import random
import shutil
import argparse
from tqdm import tqdm
import yaml

def split_data(dataset_dir, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    assert train_ratio + val_ratio + test_ratio == 1, "Ratios must add up to 1."

    # Get the list of all images in the directory
    images = [file for file in os.listdir(dataset_dir) if file.endswith('.jpg') or file.endswith('.png')]

    random.shuffle(images)  # shuffle the list

    # calculate the size of each subset
    train_size = int(train_ratio * len(images))
    val_size = int(val_ratio * len(images))

    # split the images list into three subsets
    train_images = images[:train_size]
    val_images = images[train_size:train_size+val_size]
    test_images = images[train_size+val_size:]

    # create directories for train, val, and test if not exist
    for subset in ["train", "val", "test"]:
        subset_dir = os.path.join(dataset_dir, subset)
        os.makedirs(subset_dir, exist_ok=True)

    # move the images to their respective directories
    for subset, images in zip(["train", "val", "test"], [train_images, val_images, test_images]):
        for image in tqdm(images, desc=f"Moving {subset} images"):  # use tqdm for progress tracking
            shutil.move(os.path.join(dataset_dir, image), os.path.join(dataset_dir, subset, image))
            if os.path.exists(os.path.join(dataset_dir, os.path.splitext(image)[0] + '.txt')):  # also move the annotation if exists
                shutil.move(os.path.join(dataset_dir, os.path.splitext(image)[0] + '.txt'), os.path.join(dataset_dir, subset, os.path.splitext(image)[0] + '.txt'))

if __name__ == "__main__":
    '''
    The script splits the dataset into train/val/test subsets.
    It takes the path to data.yaml file as an argument.
    '''
    parser = argparse.ArgumentParser(description='Split YOLO dataset into train/val/test.')
    parser.add_argument('data_yaml', type=str, help='Location of data.yaml')
    parser.add_argument('--train_ratio', type=float, default=0.8, help='Train set ratio (default: 0.8)')
    parser.add_argument('--val_ratio', type=float, default=0.1, help='Validation set ratio (default: 0.1)')
    parser.add_argument('--test_ratio', type=float, default=0.1, help='Test set ratio (default: 0.1)')

    args = parser.parse_args()

    with open(args.data_yaml, 'r') as file:
        data_yaml = yaml.safe_load(file)

    split_data(data_yaml['train'], args.train_ratio, args.val_ratio, args.test_ratio)

4) Fix yolo annotations


from pathlib import Path
import argparse
from tqdm import tqdm

def modify_annotations(folder, add):
    folder_path = Path(folder)
    min_class = int(1e9)
    max_class = -int(1e9)
    n_files = 0
    for subdir in folder_path.iterdir():
        if subdir.is_dir():  # make sure it's a directory, not a file
            for file_path in subdir.rglob('*.txt'):  # go through all .txt files in this subdirectory
                lines = []
                with open(file_path, 'r') as file:
                    lines = file.readlines()

                for i, line in enumerate(lines):
                    elements = line.split()
                    val = int(elements[0])
                    elements[0] = str(val + add)
                    if val < min_class:
                        min_class = val
                    if val > max_class:
                        max_class = val
                    lines[i] = ' '.join(elements) + '\n'

                with open(file_path, 'w') as file:
                    file.writelines(lines)
                    n_files += 1
    print(f"Min class: {min_class}\nMax class, {max_class}\nNumber of files modified: {n_files}")

if __name__ == "__main__":
    '''
    The script recursively goes through folder and subfolders and modifies all .txt files in them.
    It adds a number to the class number in each line of the .txt files.
    '''
    parser = argparse.ArgumentParser(description='Modify class numbers in YOLO annotations.')
    parser.add_argument('folder', type=str, help='Parent folder location')
    parser.add_argument('add', type=int, help='Number to add to class numbers', default=-1)

    args = parser.parse_args()

    modify_annotations(args.folder, args.add)

5) Move files into corresponding folders

This can be achieved with simple bash commands:

ulimit -s unlimited
mkdir test/images test/labels train/images train/labels val/images val/labels
mv test/*.jpg test/images
mv test/*.txt test/labels

mv train/*.jpg train/images
mv train/*.txt train/labels

mv val/*.jpg val/images
mv val/*.txt val/labels

6) Update and rename data.yml to data.yaml and fix paths

File contents should be somehting like this:

nc: 3
names:
    0: class_name1
    1: class_name2
    2: class_name3
train: <path>/train
test: <path>/test
val: <path>/val

The final structure should look something like this: β”œβ”€β”€ test β”‚Β Β  β”œβ”€β”€ images β”‚Β Β  └── labels β”œβ”€β”€ train β”‚Β Β  β”œβ”€β”€ images β”‚Β Β  └── labels β”œβ”€β”€ val β”‚Β Β  β”œβ”€β”€ images β”‚Β Β  └── labels └── data.yaml

7) Attempt to train

yolo train \
batch=128 device=0,1 imgsz=640 epochs=100 patience=100 \
model=yolo11l \
data=<path>/data.yaml

This will produce a crash outlined above. Within the original sliced dataset, there are roughly this many images: train: 76000 images, 69060 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 76000/76000 val: 9500 images, 8642 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9500/9500

Now here is an interesting part:

If I add an additional step in preprocessing, removing all background images from sliced dataset, the training will work fine. This can be done using cocojson library, like so: python3 -m cocojson.run.remove_empty <annotations_json> --out <output_dataset_json_path>

This will produce a significantly smaller subset, but it will also not contain any background images. I do not believe this to be conceptually correct, as you also want negative samples in training as to not have positive sample bias, right?

The resulting dataset has this size: train: 6915 images, 0 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6915/6915 val: 864 images, 0 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 864/864

Please, advise what can be done. Is this an issue with the size of the dataset, or the ratio of positive/negative samples, or is it something entirely different.

Environment

Setup complete βœ… (128 CPUs, 503.5 GB RAM, 74.3/454.5 GB disk)

OS                  Linux-6.11.5-arch1-1-x86_64-with-glibc2.40
Environment         Linux
Python              3.12.7
Install             pip
RAM                 503.50 GB
Disk                74.3/454.5 GB
CPU                 AMD Ryzen Threadripper PRO 3995WX 64-Cores
CPU count           128
GPU                 NVIDIA RTX A6000, 48570MiB
GPU count           2
CUDA                12.1

numpy               βœ… 1.26.4>=1.23.0
matplotlib          βœ… 3.9.2>=3.3.0
opencv-python       βœ… 4.9.0.80>=4.6.0
pillow              βœ… 10.4.0>=7.1.2
pyyaml              βœ… 6.0.2>=5.3.1
requests            βœ… 2.32.3>=2.23.0
scipy               βœ… 1.14.1>=1.4.1
torch               βœ… 2.4.1>=1.8.0
torchvision         βœ… 0.19.1>=0.9.0
tqdm                βœ… 4.66.5>=4.64.0
psutil              βœ… 6.1.0
py-cpuinfo          βœ… 9.0.0
pandas              βœ… 2.2.2>=1.1.4
seaborn             βœ… 0.13.2>=0.11.0
ultralytics-thop    βœ… 2.0.5>=2.0.0
numpy               βœ… 1.26.4<2.0.0; sys_platform == "darwin"
torch               βœ… 2.4.1!=2.4.0,>=1.8.0; sys_platform == "win32"

Minimal Reproducible Example

Please, see code snippets above. It might be difficult to provide sample dataset, due to the size.

Additional

No response

Are you willing to submit a PR?

UltralyticsAssistant commented 2 weeks ago

πŸ‘‹ Hello @eVen-gits, thank you for your detailed report regarding the crash with a large amount of background images. We're excited to help you out! πŸš€

For bug reports, it's very helpful if you can provide a minimum reproducible example. This will allow us to more effectively diagnose and resolve the issue.

To get started, please ensure you're using the latest version of the ultralytics package and all its requirements in a Python environment of version 3.8 or higher with PyTorch version 1.8 or higher:

pip install -U ultralytics

In case you are facing dataset-related issues with slicing or format transformations, please ensure your data preprocessing steps and outputs, like annotations, adhere closely to the expected formats after every stage. This can be key when errors appear unexpectedly during model training.

You can also try running your training in one of our recommended environments to verify it’s not an issue with your local setup. Here are some options with all dependencies preinstalled:

Feel free to continue exploring the fantastic resources in our Docs, including specific guidance that might relate to your issue.

And remember, the Ultralytics community is here to support you! For real-time interaction, join us on Discord 🎧 or share your experience with others on Reddit.

This is an automated response, but an Ultralytics engineer will also look into your issue shortly. Meanwhile, additional insights you provide may speed up the process. Thanks again for reaching out! 😊

Y-T-G commented 2 weeks ago

Run it with CPU and post the error

eVen-gits commented 2 weeks ago

Run it with CPU and post the error

That's a great idea, I haven't thought of that.

What is worth noting, is that the GPU training command does not crash immediately, but instead, takes some time. I am now running with CPU, but it might take some time before it crashes, IF it crashes, as you are aware that CPU training is significantly slower.

Here's the launch command that I'm using:

yolo train \       
batch=64 device=cpu imgsz=640 epochs=100 patience=100 \
model=yolo11l \
data=<path>/data.yaml

Output:

Ultralytics 8.3.24 πŸš€ Python-3.12.7 torch-2.4.1+cu121 CPU (AMD Ryzen Threadripper PRO 3995WX 64-Cores)
engine/trainer: task=detect, mode=train, model=yolo11l, data=<path>/data.yaml, epochs=100, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=False, device=cpu, workers=8, project=None, name=train4, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=<path>/runs/detect/train4
Overriding model.yaml nc=80 with nc=3

                   from  n    params  module                                       arguments                     
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]                 
  1                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  2                  -1  2    173824  ultralytics.nn.modules.block.C3k2            [128, 256, 2, True, 0.25]     
  3                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
  4                  -1  2    691712  ultralytics.nn.modules.block.C3k2            [256, 512, 2, True, 0.25]     
  5                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
  6                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]           
  7                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
  8                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  2   1455616  ultralytics.nn.modules.block.C2PSA           [512, 512, 2]                 
 11                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 12             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 13                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]          
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 15             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 16                  -1  2    756736  ultralytics.nn.modules.block.C3k2            [1024, 256, 2, True]          
 17                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 18            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 19                  -1  2   2365440  ultralytics.nn.modules.block.C3k2            [768, 512, 2, True]           
 20                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
 21            [-1, 10]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 22                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]          
 23        [16, 19, 22]  1   1413337  ultralytics.nn.modules.head.Detect           [3, [256, 512, 512]]          
YOLO11l summary: 631 layers, 25,312,793 parameters, 25,312,777 gradients, 87.3 GFLOPs

Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
train: Scanning <path>/exp/train/labels.cache... 76000 images, 69060 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 76000/76000 [00:00<?, ?it/s]
val: Scanning<path>/exp/val/labels.cache... 9500 images, 8642 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9500/9500 [00:00<?, ?it/s]
Plotting labels to <path>/runs/detect/train4/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to <path>/runs/detect/train4
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100         0G      3.107      25.97      2.813          7        640:   1%|          | 6/1188 [03:41<12:06:52, 36.90s/it]

So far, it appears to be working. I will post the error in a separate comment, if it does eventually crash.

Y-T-G commented 2 weeks ago

I am guessing it crashes when a batch with no targets appear, i.e. all of the images in the batch are background.

eVen-gits commented 2 weeks ago

I am guessing it crashes when a batch with no targets appear, i.e. all of the images in the batch are background.

Indeed - that could be a case.

I am thinking now, that it could be tested by creating a dataset with only empty (background) images.

Still, in my case, I am not sure how this could be addressed. As mentioned above, there's a large amount of images. I guess I could make some manual scripts to limit the amount of background data and maybe that would help. But I am only guessing here.

Thing is, I was using this some time ago with similar datasets, but I didn't have issues. That's why it seems curious.

Additionally, if it all works fine on CPU, the issue might lie somewhere else.

I will see the results tomorrow once the CPU training has gone on for long enough and I'll report back.

glenn-jocher commented 2 weeks ago

It sounds like the issue might be related to batches with only background images. You could try reducing the number of background images to see if that resolves the problem. If CPU training works fine, it might indicate a GPU-specific issue. Let us know how it goes!

eVen-gits commented 1 week ago

It sounds like the issue might be related to batches with only background images. You could try reducing the number of background images to see if that resolves the problem. If CPU training works fine, it might indicate a GPU-specific issue. Let us know how it goes!

Indeed. Is there a simple procedure to do this?

As I understand, it would be best, if the ratio between labeled and background images in training and live dataset, should be comparable, right? I will try to limit the background images with a custom script, to see if this is the problem.

I've executed CPU run overnight, and indeed, it also crashed. As it appears, on the same code point.

yolo train \       
batch=64 device=cpu imgsz=640 epochs=100 patience=100 \
model=yolo11l \
data=<paths>/runs/coco2yolov5/exp/data.yaml 

Ultralytics 8.3.24 πŸš€ Python-3.12.7 torch-2.4.1+cu121 CPU (AMD Ryzen Threadripper PRO 3995WX 64-Cores)
engine/trainer: task=detect, mode=train, model=yolo11l, data=<paths>//runs/coco2yolov5/exp/data.yaml, epochs=100, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=False, device=cpu, workers=8, project=None, name=train4, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=<paths>/runs/detect/train4
Overriding model.yaml nc=80 with nc=3

                   from  n    params  module                                       arguments                     
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]                 
  1                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  2                  -1  2    173824  ultralytics.nn.modules.block.C3k2            [128, 256, 2, True, 0.25]     
  3                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
  4                  -1  2    691712  ultralytics.nn.modules.block.C3k2            [256, 512, 2, True, 0.25]     
  5                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
  6                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]           
  7                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
  8                  -1  2   2234368  ultralytics.nn.modules.block.C3k2            [512, 512, 2, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  2   1455616  ultralytics.nn.modules.block.C2PSA           [512, 512, 2]                 
 11                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 12             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 13                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]          
 14                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 15             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 16                  -1  2    756736  ultralytics.nn.modules.block.C3k2            [1024, 256, 2, True]          
 17                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 18            [-1, 13]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 19                  -1  2   2365440  ultralytics.nn.modules.block.C3k2            [768, 512, 2, True]           
 20                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
 21            [-1, 10]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 22                  -1  2   2496512  ultralytics.nn.modules.block.C3k2            [1024, 512, 2, True]          
 23        [16, 19, 22]  1   1413337  ultralytics.nn.modules.head.Detect           [3, [256, 512, 512]]          
YOLO11l summary: 631 layers, 25,312,793 parameters, 25,312,777 gradients, 87.3 GFLOPs

Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
train: Scanning <paths>/runs/coco2yolov5/exp/train/labels.cache... 76000 images, 69060 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 76000/76000 [00:00<?, ?it/s]
val: Scanning <paths>/runs/coco2yolov5/exp/val/labels.cache... 9500 images, 8642 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9500/9500 [00:00<?, ?it/s]
Plotting labels to <paths>/runs/detect/train4/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to <paths>/runs/detect/train4
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100         0G      1.404      5.304      1.369          4        640:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 680/1188 [6:54:42<5:09:48, 36.59s/it] 
Traceback (most recent call last):
  File "<paths>/.local/bin/yolo", line 8, in <module>
    sys.exit(entrypoint())
             ^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/cfg/__init__.py", line 826, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/model.py", line 802, in train
    self.trainer.train()
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 207, in train
    self._do_train(world_size)
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 385, in _do_train
    self.loss, self.loss_items = self.model(batch)
                                 ^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 111, in forward
    return self.loss(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 293, in loss
    return self.criterion(preds, batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/loss.py", line 234, in __call__
    _, target_bboxes, target_scores, fg_mask, _ = self.assigner(
                                                  ^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 72, in forward
    mask_pos, align_metric, overlaps = self.get_pos_mask(
                                       ^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 94, in get_pos_mask
    align_metric, overlaps = self.get_box_metrics(pd_scores, pd_bboxes, gt_labels, gt_bboxes, mask_in_gts * mask_gt)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 113, in get_box_metrics
    bbox_scores[mask_gt] = pd_scores[ind[0], :, ind[1]][mask_gt]  # b, max_num_obj, h*w
                           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: index 3 is out of bounds for dimension 1 with size 3
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
Y-T-G commented 1 week ago

This error usually occurs if you have some labels that is using an invalid class index. For example, using class index 10 when your class indices are from 0-5. Check all you txt files.

https://github.com/ultralytics/ultralytics/issues/472#issuecomment-1579145897

eVen-gits commented 1 week ago

This error usually occurs if you have some labels that is using an invalid class index. For example, using class index 10 when your class indices are from 0-5. Check all you txt files.

#472 (comment)

Hey! Thanks for pointing it out.

That is correct. This is why I do some manual preprocessing, see step 4). The script reads all the annotations and adds/subtrats (depending on launch arguments) from the class and it outputs min/max class ID at the end.

I have 3 classes in my dataset, my max class ID is 2, and my min ID is 0.

I am aware this is an issue and an inconsistency between coco/yolo dataset (or it could be SAHI causing it), but I already checked this.

Y-T-G commented 1 week ago

Delete your labels.cache file and run again

eVen-gits commented 1 week ago

Delete your labels.cache file and run again

Thanks for recommendation. I have tried this multiple times as I can (re) generate a dataset relatively simply. So far, I have two suspicions:

1) While the preprocessing step in yolo might recognize foregrounds and backgrounds, it doesn't actually work well runtime if the ratio is too high (ex. 90% backgrounds). As such, there might happen a scenario, where an entire batch consists of background images only. Then the system tries to find some annotations and it fails, resulting in above error

2) Alternatively, the system works fine, but it doesn't play well with empty annotation files for background images. The end result might be the same, but it could be possible, that it was better if there were no empty annotation files instead.

Currently, I have reduced the ratio of background files to 50% and the training is working as intended (~7000 training samples and ~7000 background images).

Later, I intend to try two additional scenarios: 1) Configure a very small batch (maybe even 1), so that when loading, it's highly likely to just get a background image and I want to see if it crashes in this case. 2) Attempt to remove all empty annotation files and run again (with full dataset where FG/BG ratio is still 0.1).

glenn-jocher commented 1 week ago

It seems like reducing the background image ratio has helped. Testing with a small batch size and removing empty annotation files are good next steps. Let us know how it goes!