Transferred 651/657 items from pretrained weights
AutoBatch: Computing optimal batch size for imgsz=2176
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-40GB) 39.56G total, 0.35G reserved, 0.35G allocated, 38.86G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
45938361 2552 14.506 69.34 nan (1, 3, 2176, 2176) list
45938361 5105 28.242 76.7 nan (2, 3, 2176, 2176) list
CUDA out of memory. Tried to allocate 578.00 MiB (GPU 0; 39.56 GiB total capacity; 35.69 GiB already allocated; 94.56 MiB free; 36.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 1.13 GiB (GPU 0; 39.56 GiB total capacity; 35.16 GiB already allocated; 136.56 MiB free; 36.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 578.00 MiB (GPU 0; 39.56 GiB total capacity; 34.86 GiB already allocated; 174.56 MiB free; 36.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
AutoBatch: Using batch-size 1 for CUDA:0 15.21G/39.56G (38%) ✅
optimizer: SGD(lr=0.01) with parameter groups 106 weight(decay=0.0), 117 weight(decay=0.0005), 116 bias
train: Scanning /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/labels/train.cache... 376 images, 0 backgrounds, 0 corrupt: 100% 376/376 [00:00<?, ?it/s]
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
val: Scanning /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/labels/valid.cache... 46 images, 0 backgrounds, 0 corrupt: 100% 46/46 [00:00<?, ?it/s]
Image sizes 2176 train, 2176 val
Using 0 dataloader workers
Logging results to runs/segment/train9
Starting training for 150 epochs...
Traceback (most recent call last):
File "/usr/local/bin/yolo", line 8, in
sys.exit(entrypoint())
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/cfg/init.py", line 266, in entrypoint
getattr(model, mode)(vars(cfg))
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/model.py", line 214, in train
self.trainer.train()
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/trainer.py", line 182, in train
self._do_train(int(os.getenv("RANK", -1)), world_size)
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/trainer.py", line 283, in _do_train
for i, batch in pbar:
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 181, in getitem
return self.transforms(self.get_label_info(index))
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 186, in get_label_info
label["img"], label["ori_shape"], label["resized_shape"] = self.load_image(index)
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 124, in load_image
raise FileNotFoundError(f"Image Not Found {f}")
FileNotFoundError: Image Not Found /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/images/train/DJI_0013-00_00_29_19-Still026_jpg.rf.d032a549c22ad5601cf5e45ca7ef4af2.jpg**
I am training a YOLOV8 model with instance segmentation using the following command:
!yolo task=segment mode=train batch=-1 model=yolov8l-seg.pt data=data.yaml epochs=150 imgsz=2176 save=true
Training proceeds normally, until it reaches the 50th - 52nd Epoch.
Training fails because it cannot find the training image. I can confirm the image in question DOES exist in the directory.
I've attempted a restart several times and it always fails at 50th Epoch. This leads me to believe it is a time-out issue.
The following is the training output. (Error in bold).
__ Ultralytics YOLOv8.0.28 🚀 Python-3.10.12 torch-2.0.1+cu118 CUDA:0 (NVIDIA A100-SXM4-40GB, 40514MiB) yolo/engine/trainer: task=segment, mode=train, model=yolov8l-seg.pt, data=data.yaml, epochs=150, patience=50, batch=-1, imgsz=2176, save=True, cache=False, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=True, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, save_dir=runs/segment/train9 Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf... 100% 755k/755k [00:00<00:00, 138MB/s] 2023-08-08 00:22:40.271747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Overriding model.yaml nc=80 with nc=3
0 -1 1 1856 ultralytics.nn.modules.Conv [3, 64, 3, 2]
1 -1 1 73984 ultralytics.nn.modules.Conv [64, 128, 3, 2]
2 -1 3 279808 ultralytics.nn.modules.C2f [128, 128, 3, True]
3 -1 1 295424 ultralytics.nn.modules.Conv [128, 256, 3, 2]
4 -1 6 2101248 ultralytics.nn.modules.C2f [256, 256, 6, True]
5 -1 1 1180672 ultralytics.nn.modules.Conv [256, 512, 3, 2]
6 -1 6 8396800 ultralytics.nn.modules.C2f [512, 512, 6, True]
7 -1 1 2360320 ultralytics.nn.modules.Conv [512, 512, 3, 2]
8 -1 3 4461568 ultralytics.nn.modules.C2f [512, 512, 3, True]
9 -1 1 656896 ultralytics.nn.modules.SPPF [512, 512, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.Concat [1]
12 -1 3 4723712 ultralytics.nn.modules.C2f [1024, 512, 3]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.Concat [1]
15 -1 3 1247744 ultralytics.nn.modules.C2f [768, 256, 3]
16 -1 1 590336 ultralytics.nn.modules.Conv [256, 256, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.Concat [1]
18 -1 3 4592640 ultralytics.nn.modules.C2f [768, 512, 3]
19 -1 1 2360320 ultralytics.nn.modules.Conv [512, 512, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.Concat [1]
21 -1 3 4723712 ultralytics.nn.modules.C2f [1024, 512, 3]
22 [15, 18, 21] 1 7891321 ultralytics.nn.modules.Segment [3, 32, 256, [256, 512, 512]] YOLOv8l-seg summary: 401 layers, 45938361 parameters, 45938345 gradients, 220.8 GFLOPs
Transferred 651/657 items from pretrained weights AutoBatch: Computing optimal batch size for imgsz=2176 AutoBatch: CUDA:0 (NVIDIA A100-SXM4-40GB) 39.56G total, 0.35G reserved, 0.35G allocated, 38.86G free Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output 45938361 2552 14.506 69.34 nan (1, 3, 2176, 2176) list 45938361 5105 28.242 76.7 nan (2, 3, 2176, 2176) list CUDA out of memory. Tried to allocate 578.00 MiB (GPU 0; 39.56 GiB total capacity; 35.69 GiB already allocated; 94.56 MiB free; 36.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF CUDA out of memory. Tried to allocate 1.13 GiB (GPU 0; 39.56 GiB total capacity; 35.16 GiB already allocated; 136.56 MiB free; 36.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF CUDA out of memory. Tried to allocate 578.00 MiB (GPU 0; 39.56 GiB total capacity; 34.86 GiB already allocated; 174.56 MiB free; 36.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF AutoBatch: Using batch-size 1 for CUDA:0 15.21G/39.56G (38%) ✅ optimizer: SGD(lr=0.01) with parameter groups 106 weight(decay=0.0), 117 weight(decay=0.0005), 116 bias train: Scanning /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/labels/train.cache... 376 images, 0 backgrounds, 0 corrupt: 100% 376/376 [00:00<?, ?it/s] albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) val: Scanning /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/labels/valid.cache... 46 images, 0 backgrounds, 0 corrupt: 100% 46/46 [00:00<?, ?it/s] Image sizes 2176 train, 2176 val Using 0 dataloader workers Logging results to runs/segment/train9 Starting training for 150 epochs...
Traceback (most recent call last): File "/usr/local/bin/yolo", line 8, in
sys.exit(entrypoint())
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/cfg/init.py", line 266, in entrypoint
getattr(model, mode)(vars(cfg))
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/model.py", line 214, in train
self.trainer.train()
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/trainer.py", line 182, in train
self._do_train(int(os.getenv("RANK", -1)), world_size)
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/engine/trainer.py", line 283, in _do_train
for i, batch in pbar:
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 181, in getitem
return self.transforms(self.get_label_info(index))
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 186, in get_label_info
label["img"], label["ori_shape"], label["resized_shape"] = self.load_image(index)
File "/usr/local/lib/python3.10/dist-packages/ultralytics/yolo/data/base.py", line 124, in load_image
raise FileNotFoundError(f"Image Not Found {f}")
FileNotFoundError: Image Not Found /content/drive/.shortcut-targets-by-id/1of8frlV3H1_GB4M8xYLa96K0MYP-Xjx5/Fire Behavior/Data/YOLOv6/7_20_2023/images/train/DJI_0013-00_00_29_19-Still026_jpg.rf.d032a549c22ad5601cf5e45ca7ef4af2.jpg**