Open eVen-gits opened 2 weeks ago
π Hello @eVen-gits, thank you for your detailed report regarding the crash with a large amount of background images. We're excited to help you out! π
For bug reports, it's very helpful if you can provide a minimum reproducible example. This will allow us to more effectively diagnose and resolve the issue.
To get started, please ensure you're using the latest version of the ultralytics
package and all its requirements in a Python environment of version 3.8 or higher with PyTorch version 1.8 or higher:
pip install -U ultralytics
In case you are facing dataset-related issues with slicing or format transformations, please ensure your data preprocessing steps and outputs, like annotations, adhere closely to the expected formats after every stage. This can be key when errors appear unexpectedly during model training.
You can also try running your training in one of our recommended environments to verify itβs not an issue with your local setup. Here are some options with all dependencies preinstalled:
Feel free to continue exploring the fantastic resources in our Docs, including specific guidance that might relate to your issue.
And remember, the Ultralytics community is here to support you! For real-time interaction, join us on Discord π§ or share your experience with others on Reddit.
This is an automated response, but an Ultralytics engineer will also look into your issue shortly. Meanwhile, additional insights you provide may speed up the process. Thanks again for reaching out! π
Run it with CPU and post the error
Run it with CPU and post the error
That's a great idea, I haven't thought of that.
What is worth noting, is that the GPU training command does not crash immediately, but instead, takes some time. I am now running with CPU, but it might take some time before it crashes, IF it crashes, as you are aware that CPU training is significantly slower.
Here's the launch command that I'm using:
yolo train \
batch=64 device=cpu imgsz=640 epochs=100 patience=100 \
model=yolo11l \
data=<path>/data.yaml
Output:
Ultralytics 8.3.24 π Python-3.12.7 torch-2.4.1+cu121 CPU (AMD Ryzen Threadripper PRO 3995WX 64-Cores)
engine/trainer: task=detect, mode=train, model=yolo11l, data=<path>/data.yaml, epochs=100, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=False, device=cpu, workers=8, project=None, name=train4, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=<path>/runs/detect/train4
Overriding model.yaml nc=80 with nc=3
from n params module arguments
0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2]
1 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
2 -1 2 173824 ultralytics.nn.modules.block.C3k2 [128, 256, 2, True, 0.25]
3 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
4 -1 2 691712 ultralytics.nn.modules.block.C3k2 [256, 512, 2, True, 0.25]
5 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
6 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
7 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
8 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 2 1455616 ultralytics.nn.modules.block.C2PSA [512, 512, 2]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 2 756736 ultralytics.nn.modules.block.C3k2 [1024, 256, 2, True]
17 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 2 2365440 ultralytics.nn.modules.block.C3k2 [768, 512, 2, True]
20 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
23 [16, 19, 22] 1 1413337 ultralytics.nn.modules.head.Detect [3, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,312,793 parameters, 25,312,777 gradients, 87.3 GFLOPs
Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
train: Scanning <path>/exp/train/labels.cache... 76000 images, 69060 backgrounds, 0 corrupt: 100%|ββββββββββ| 76000/76000 [00:00<?, ?it/s]
val: Scanning<path>/exp/val/labels.cache... 9500 images, 8642 backgrounds, 0 corrupt: 100%|ββββββββββ| 9500/9500 [00:00<?, ?it/s]
Plotting labels to <path>/runs/detect/train4/labels.jpg...
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to <path>/runs/detect/train4
Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 0G 3.107 25.97 2.813 7 640: 1%| | 6/1188 [03:41<12:06:52, 36.90s/it]
So far, it appears to be working. I will post the error in a separate comment, if it does eventually crash.
I am guessing it crashes when a batch with no targets appear, i.e. all of the images in the batch are background.
I am guessing it crashes when a batch with no targets appear, i.e. all of the images in the batch are background.
Indeed - that could be a case.
I am thinking now, that it could be tested by creating a dataset with only empty (background) images.
Still, in my case, I am not sure how this could be addressed. As mentioned above, there's a large amount of images. I guess I could make some manual scripts to limit the amount of background data and maybe that would help. But I am only guessing here.
Thing is, I was using this some time ago with similar datasets, but I didn't have issues. That's why it seems curious.
Additionally, if it all works fine on CPU, the issue might lie somewhere else.
I will see the results tomorrow once the CPU training has gone on for long enough and I'll report back.
It sounds like the issue might be related to batches with only background images. You could try reducing the number of background images to see if that resolves the problem. If CPU training works fine, it might indicate a GPU-specific issue. Let us know how it goes!
It sounds like the issue might be related to batches with only background images. You could try reducing the number of background images to see if that resolves the problem. If CPU training works fine, it might indicate a GPU-specific issue. Let us know how it goes!
Indeed. Is there a simple procedure to do this?
As I understand, it would be best, if the ratio between labeled and background images in training and live dataset, should be comparable, right? I will try to limit the background images with a custom script, to see if this is the problem.
I've executed CPU run overnight, and indeed, it also crashed. As it appears, on the same code point.
yolo train \
batch=64 device=cpu imgsz=640 epochs=100 patience=100 \
model=yolo11l \
data=<paths>/runs/coco2yolov5/exp/data.yaml
Ultralytics 8.3.24 π Python-3.12.7 torch-2.4.1+cu121 CPU (AMD Ryzen Threadripper PRO 3995WX 64-Cores)
engine/trainer: task=detect, mode=train, model=yolo11l, data=<paths>//runs/coco2yolov5/exp/data.yaml, epochs=100, time=None, patience=100, batch=64, imgsz=640, save=True, save_period=-1, cache=False, device=cpu, workers=8, project=None, name=train4, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=True, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, copy_paste_mode=flip, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=<paths>/runs/detect/train4
Overriding model.yaml nc=80 with nc=3
from n params module arguments
0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2]
1 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
2 -1 2 173824 ultralytics.nn.modules.block.C3k2 [128, 256, 2, True, 0.25]
3 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
4 -1 2 691712 ultralytics.nn.modules.block.C3k2 [256, 512, 2, True, 0.25]
5 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
6 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
7 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
8 -1 2 2234368 ultralytics.nn.modules.block.C3k2 [512, 512, 2, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 2 1455616 ultralytics.nn.modules.block.C2PSA [512, 512, 2]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 2 756736 ultralytics.nn.modules.block.C3k2 [1024, 256, 2, True]
17 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 2 2365440 ultralytics.nn.modules.block.C3k2 [768, 512, 2, True]
20 -1 1 2360320 ultralytics.nn.modules.conv.Conv [512, 512, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 2 2496512 ultralytics.nn.modules.block.C3k2 [1024, 512, 2, True]
23 [16, 19, 22] 1 1413337 ultralytics.nn.modules.head.Detect [3, [256, 512, 512]]
YOLO11l summary: 631 layers, 25,312,793 parameters, 25,312,777 gradients, 87.3 GFLOPs
Transferred 1009/1015 items from pretrained weights
Freezing layer 'model.23.dfl.conv.weight'
train: Scanning <paths>/runs/coco2yolov5/exp/train/labels.cache... 76000 images, 69060 backgrounds, 0 corrupt: 100%|ββββββββββ| 76000/76000 [00:00<?, ?it/s]
val: Scanning <paths>/runs/coco2yolov5/exp/val/labels.cache... 9500 images, 8642 backgrounds, 0 corrupt: 100%|ββββββββββ| 9500/9500 [00:00<?, ?it/s]
Plotting labels to <paths>/runs/detect/train4/labels.jpg...
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 167 weight(decay=0.0), 174 weight(decay=0.0005), 173 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to <paths>/runs/detect/train4
Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 0G 1.404 5.304 1.369 4 640: 57%|ββββββ | 680/1188 [6:54:42<5:09:48, 36.59s/it]
Traceback (most recent call last):
File "<paths>/.local/bin/yolo", line 8, in <module>
sys.exit(entrypoint())
^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/cfg/__init__.py", line 826, in entrypoint
getattr(model, mode)(**overrides) # default args from model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/model.py", line 802, in train
self.trainer.train()
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 207, in train
self._do_train(world_size)
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/engine/trainer.py", line 385, in _do_train
self.loss, self.loss_items = self.model(batch)
^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 111, in forward
return self.loss(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/nn/tasks.py", line 293, in loss
return self.criterion(preds, batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/loss.py", line 234, in __call__
_, target_bboxes, target_scores, fg_mask, _ = self.assigner(
^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 72, in forward
mask_pos, align_metric, overlaps = self.get_pos_mask(
^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 94, in get_pos_mask
align_metric, overlaps = self.get_box_metrics(pd_scores, pd_bboxes, gt_labels, gt_bboxes, mask_in_gts * mask_gt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<paths>/.local/lib/python3.12/site-packages/ultralytics/utils/tal.py", line 113, in get_box_metrics
bbox_scores[mask_gt] = pd_scores[ind[0], :, ind[1]][mask_gt] # b, max_num_obj, h*w
~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: index 3 is out of bounds for dimension 1 with size 3
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
This error usually occurs if you have some labels that is using an invalid class index. For example, using class index 10 when your class indices are from 0-5. Check all you txt files.
https://github.com/ultralytics/ultralytics/issues/472#issuecomment-1579145897
This error usually occurs if you have some labels that is using an invalid class index. For example, using class index 10 when your class indices are from 0-5. Check all you txt files.
Hey! Thanks for pointing it out.
That is correct. This is why I do some manual preprocessing, see step 4). The script reads all the annotations and adds/subtrats (depending on launch arguments) from the class and it outputs min/max class ID at the end.
I have 3 classes in my dataset, my max class ID is 2, and my min ID is 0.
I am aware this is an issue and an inconsistency between coco/yolo dataset (or it could be SAHI causing it), but I already checked this.
Delete your labels.cache file and run again
Delete your labels.cache file and run again
Thanks for recommendation. I have tried this multiple times as I can (re) generate a dataset relatively simply. So far, I have two suspicions:
1) While the preprocessing step in yolo might recognize foregrounds and backgrounds, it doesn't actually work well runtime if the ratio is too high (ex. 90% backgrounds). As such, there might happen a scenario, where an entire batch consists of background images only. Then the system tries to find some annotations and it fails, resulting in above error
2) Alternatively, the system works fine, but it doesn't play well with empty annotation files for background images. The end result might be the same, but it could be possible, that it was better if there were no empty annotation files instead.
Currently, I have reduced the ratio of background files to 50% and the training is working as intended (~7000 training samples and ~7000 background images).
Later, I intend to try two additional scenarios: 1) Configure a very small batch (maybe even 1), so that when loading, it's highly likely to just get a background image and I want to see if it crashes in this case. 2) Attempt to remove all empty annotation files and run again (with full dataset where FG/BG ratio is still 0.1).
It seems like reducing the background image ratio has helped. Testing with a small batch size and removing empty annotation files are good next steps. Let us know how it goes!
Search before asking
Ultralytics YOLO Component
No response
Bug
I am working with large images, and I am using sahi to slice these images into tiles. This produces a COCO style dataset.
This dataset is then transformed to a
yolo
dataset, usingsahi coco yolov5
command. This produces a yolo style dataset that still needs preprocessing/fixes. Here is the general procedure:sahi yolo slice
sahi coco yolov5
data.yml
todata.yaml
and fix pathsThe crash appears to be happening here:
Longer output (some repeated CUDA spam ommited)
``` >>> yolo train \ batch=64 device=0,1 imgsz=640 epochs=100 patience=100 \ model=yolo11l \ data=More information on above procedures:
1) Slice the dataset
Using command
2) Transform the dataset format from COCO to YOLO
3) Split the data in train/test/val
4) Fix yolo annotations
5) Move files into corresponding folders
This can be achieved with simple bash commands:
6) Update and rename
data.yml
todata.yaml
and fix pathsFile contents should be somehting like this:
The final structure should look something like this: βββ test βΒ Β βββ images βΒ Β βββ labels βββ train βΒ Β βββ images βΒ Β βββ labels βββ val βΒ Β βββ images βΒ Β βββ labels βββ data.yaml
7) Attempt to train
This will produce a crash outlined above. Within the original sliced dataset, there are roughly this many images: train: 76000 images, 69060 backgrounds, 0 corrupt: 100%|ββββββββββ| 76000/76000 val: 9500 images, 8642 backgrounds, 0 corrupt: 100%|ββββββββββ| 9500/9500
Now here is an interesting part:
If I add an additional step in preprocessing, removing all background images from sliced dataset, the training will work fine. This can be done using
cocojson
library, like so:python3 -m cocojson.run.remove_empty <annotations_json> --out <output_dataset_json_path>
This will produce a significantly smaller subset, but it will also not contain any background images. I do not believe this to be conceptually correct, as you also want negative samples in training as to not have positive sample bias, right?
The resulting dataset has this size: train: 6915 images, 0 backgrounds, 0 corrupt: 100%|ββββββββββ| 6915/6915 val: 864 images, 0 backgrounds, 0 corrupt: 100%|ββββββββββ| 864/864
Please, advise what can be done. Is this an issue with the size of the dataset, or the ratio of positive/negative samples, or is it something entirely different.
Environment
Minimal Reproducible Example
Please, see code snippets above. It might be difficult to provide sample dataset, due to the size.
Additional
No response
Are you willing to submit a PR?