ultralytics / ultralytics

Ultralytics YOLO11 πŸš€
https://docs.ultralytics.com
GNU Affero General Public License v3.0
29.49k stars 5.78k forks source link

Yolov8 freezes during training #10158

Closed tjasmin111 closed 5 months ago

tjasmin111 commented 5 months ago

Search before asking

YOLOv8 Component

No response

Bug

I'm am facing some weird behaviors which I'm not sure why is this. I'm trying to train a Yolov8 model on a A100, and during training, it freezes on 71%. What is the issue?

Ultralytics YOLOv8.2.1 πŸš€ Python-3.8.10 torch-2.2.2+cu121 CPU (Intel Xeon Gold 6330 2.00GHz)
engine/trainer: task=classify, mode=train, model=yolov8s-cls.pt, data=/home/mydata, epochs=2, time=None, patience=100, batch=16, imgsz=320, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=train3, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/classify/train3
train: /home/mydata/train... found 310575 images in 13 classes βœ… 
val: /home/mydata/val... found 34515 images in 13 classes βœ… 
test: None...
Overriding model.yaml nc=1000 with nc=13
                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    674573  ultralytics.nn.modules.head.Classify         [512, 13]                     
YOLOv8s-cls summary: 99 layers, 5097389 parameters, 5097389 gradients, 12.6 GFLOPs
Transferred 156/158 items from pretrained weights
train: Scanning /home/mydata/train... 220553 images, 0 corrupt:  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 220530/310575 [02:19<00:43, 2078.17it/s]

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

glenn-jocher commented 5 months ago

Hey there! πŸ‘‹ It seems like you're encountering a freeze during the training of your YOLOv8 model. This can occasionally happen due to various reasons, such as insufficient memory resources, issues with the dataset, or even bugs in the software version. Here are a couple of things you might want to check:

  1. Dataset Integrity: Ensure there are no corrupt files and that the data format is consistent. Given that the scan reaches 71%, it might hint towards a particular file around that mark causing the issue.
  2. Hardware Resources: The A100 GPU is pretty powerful, but it's always a good idea to monitor system resources during the training process to rule out memory issues.
  3. Software Version: Ensure your Ultralytics YOLO package, PyTorch, and CUDA are up-to-date. Compatibility issues may sometimes cause unexpected behavior.

If these checks don’t resolve the issue, could you try reducing the dataset size or batch size and see if the problem persists? This might help isolate the cause. For example:

yolo detect train data=coco128.yaml model=yolov8n.pt epochs=2 batch=8

If the problem continues, providing more specifics about when and how it occurs will help us dig deeper. The community is here to help! πŸš€

tjasmin111 commented 5 months ago

I just tried with batch=8. Still the same behavior. Also I believe there is no corrupt images. How to find out if an image might cause problems in Yolo CLI?

glenn-jocher commented 5 months ago

@tjasmin111 hey! πŸ‘‹ It sounds like reducing the batch size didn't clear up the freeze issue during training. If you're concerned about potentially corrupt images or problematic data that could be causing the freeze, one straightforward way you could try is to employ the --imgsz flag with a smaller value when using the YOLO CLI. This can sometimes help bypass issues related to large image sizes:

yolo detect train data=coco128.yaml model=yolov8n.pt epochs=2 batch=8 imgsz=320

Additionally, if you suspect specific images might be causing the issue but aren't sure which ones, manually inspecting files around the 71% mark of your dataset could give some clues.

For a more systematic check, considering removing or isolating sections of your dataset to identify if a specific subset is causing the problem could be helpful. This approach, while more time-consuming, can pinpoint problematic data if present.

Let me know if this helps or if you have more questions! 🌟

tjasmin111 commented 5 months ago

I tried imgsz=320 and it still didn't work.

When freezing, it shows file 220530/310575. How to iterate through the files to pinpoint the file similar to what yolov8 does? Does it count based on files alphabetically sorted? Can you share a script?

glenn-jocher commented 5 months ago

Hey there! 😊 Sorry to hear you're still facing issues. If the training process freezes at a specific file count, it's likely related to how the files are being read and processed.

Yes, YOLOv8 iterates through files typically in alphanumeric order. You can use a simple Python script to mimic this behavior and identify potentially problematic files. Here’s a quick example to help you check your images:

import os
from PIL import Image

image_dir = '/path/to/your/images'
for i, file in enumerate(sorted(os.listdir(image_dir))):
    if i == 220530:  # Adjust based on where the training freezes
        file_path = os.path.join(image_dir, file)
        try:
            img = Image.open(file_path)  # Try opening the image
            img.verify()  # Verify it's an actual image
        except Exception as e:
            print(f"Problem with file: {file_path}")
            print(e)
        break  # Stop after finding the problematic file

This script checks the image where your training process freezes. Make sure to replace /path/to/your/images with the path to your dataset images.

Let me know if this helps or if you need further assistance!

tjasmin111 commented 5 months ago

I ran it. The file looks good. I tried another time, now it freezes at file 220409! Is there a way to enable Yolo extended logs or something?

However, I guess freezing probably won't raise any errors.

glenn-jocher commented 5 months ago

Hi there! Glad to hear you could check the file. πŸ‘ Since you encountered a freeze at a different point upon rerunning, it suggests the issue might not be tied to a specific file but could be related to system resources or an internal process.

You can enable more verbose output in YOLOv8 by adding the verbose=True argument in your training command, which might shed some light on what's happening around the freeze point:

yolo detect train data=coco128.yaml model=yolov8n.pt epochs=2 batch=8 verbose=True

Unfortunately, if the process is freezing without throwing an error, it might not provide much additional insight, but it's worth a shot. You could also keep an eye on system resources (like GPU and CPU utilization) using tools like htop for CPU and nvidia-smi for NVIDIA GPUs to see if something stands out.

Hope this helps a bit! Let us know how it goes.