ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
28.07k stars 5.58k forks source link

Default training parameters for yolov8n? #14530

Open Ahelsamahy opened 1 month ago

Ahelsamahy commented 1 month ago

Search before asking

Question

Hi, I'm trying to create a stipped-down yolov8n model that would only detect humans and run it on Jetson Nano. It will be used to keep track of a human in front of a small vehicle. I managed to create the model with detection for one class only using the provided .yaml file.

# Ultralytics YOLO , AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 1 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]] # 9

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 6], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, C2f, [512]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [[-1, 4], 1, Concat, [1]] # cat backbone P3
  - [-1, 3, C2f, [256]] # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]] # cat head P4
  - [-1, 3, C2f, [512]] # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]] # cat head P5
  - [-1, 3, C2f, [1024]] # 21 (P5/32-large)

  - [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)

I want to train the model to have good detection as much as the default yolov8n model (or even better on different weather conditions). On a previous answer it was mentioned what parameters it was trained on.

I trained my .yaml with them and got different score than the one from yolov8n.pt

Here is the test result between my model, yolov8n and yolov10n from running model.val(data="coco.yaml"). The test is done on RTX3090


# My striped-down model
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 313/313 [00:27<00:00, 11.20it/s]
                   all       5000      36335    0.00887    0.00588    0.00696    0.00426
                person       2693      10777      0.709      0.471      0.557      0.341
Speed: 0.2ms preprocess, 1.5ms inference, 0.0ms loss, 0.8ms postprocess per image
Saving runs/detect/val3/predictions.json...

loading annotations into memory...
Done (t=0.31s)
creating index...
index created!
Loading and preparing results...
DONE (t=2.22s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=33.89s).
Accumulating evaluation results...
DONE (t=4.85s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.004
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.007
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.007
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.005
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.007
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.008
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.010
Results saved to runs/detect/val3
Results for model loaded from ./training_runs/yolov8_run_240717_T_0933/weights/last.pt:
mAP50-95: 0.004259269502738342
mAP50: 0.006957906997456405
mAP75: 0.00438899912665478
mAPs by category: [    0.34074           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
           0           0]

# YoloV8n

WARNING ⚠️ updating to 'imgsz=640'. 'train' and 'val' imgsz must be an integer, while 'predict' and 'export' imgsz may be a [h, w] list or an integer, i.e. 'yolo export imgsz=640,480' or 'yolo export imgsz=640'
Ultralytics YOLOv8.2.54  Python-3.10.12 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)
YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
val: Scanning /mnt/sdb1/users-data/ahmedmahfouz/following/Inf-FOL/datasets/coco/labels/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 313/313 [00:53<00:00,  5.84it/s]
                   all       5000      36335      0.632      0.475      0.521      0.371
                person       2693      10777      0.753      0.673      0.745      0.514
Speed: 0.2ms preprocess, 0.9ms inference, 0.0ms loss, 0.7ms postprocess per image
Saving runs/detect/val4/predictions.json...

loading annotations into memory...
Done (t=0.33s)
creating index...
index created!
Loading and preparing results...
DONE (t=4.71s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=84.95s).
Accumulating evaluation results...
DONE (t=19.77s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.374
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.526
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.405
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.188
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.410
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.320
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.589
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.369
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.654
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.768
Results saved to runs/detect/val4
Results for model loaded from yolov8n.pt:
mAP50-95: 0.3709853852916213
mAP50: 0.5209078049243206
mAP75: 0.40311046304685466
mAPs by category: [    0.51427     0.26433     0.36388     0.41327     0.65279     0.62003     0.64572     0.29313     0.21026     0.21124     0.60849     0.63023     0.44099     0.19329     0.27781      0.6516     0.59118     0.52444     0.45966     0.48716     0.62998      0.6893     0.65905     0.68315     0.10033     0.35931
    0.084863      0.2684     0.34218      0.5841       0.188     0.26651     0.32832     0.37976      0.2157     0.30202     0.45131     0.30924     0.39737     0.29755     0.26955     0.35063     0.26368     0.10589    0.098828     0.39031     0.23341     0.15666     0.34894     0.28057     0.20968     0.18901
     0.36425      0.5018     0.40772     0.29238     0.25713     0.43462     0.22562     0.42644     0.29289     0.64187     0.55252     0.57991      0.5266     0.16011     0.48176     0.27999      0.5139     0.34916     0.31403     0.33706     0.51057    0.096766     0.45648     0.32102     0.27798     0.42001
   0.0037661     0.16513]

# YoloV10n

Ultralytics YOLOv8.2.54  Python-3.10.12 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24260MiB)
YOLOv10n summary (fused): 285 layers, 2762608 parameters, 0 gradients, 8.6 GFLOPs
val: Scanning /mnt/sdb1/users-data/ahmedmahfouz/following/Inf-FOL/datasets/coco/labels/val2017.cache... 4952 images, 48 backgrounds, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 313/313 [00:51<00:00,  6.03it/s]
                   all       5000      36335      0.644      0.488      0.534      0.383
                person       2693      10777      0.766      0.655      0.744      0.519

Speed: 0.2ms preprocess, 1.4ms inference, 0.0ms loss, 0.1ms postprocess per image
Saving runs/detect/val5/predictions.json...

loading annotations into memory...
Done (t=0.30s)
creating index...
index created!
Loading and preparing results...
DONE (t=4.07s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=77.10s).
Accumulating evaluation results...
DONE (t=15.71s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.385
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.538
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.417
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.190
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.423
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.546
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.539
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.603
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.379
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.659
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.779
Results saved to runs/detect/val5
Results for model loaded from yolov10n.pt:
mAP50-95: 0.38328865898217773
mAP50: 0.5339398378888086
mAP75: 0.4160252082713042
mAPs by category: [    0.51903     0.27192     0.37043     0.43695     0.68256      0.6436     0.67419     0.32235     0.22041     0.22019       0.647     0.59678     0.48338      0.2146     0.28719     0.69614     0.62605     0.57865     0.49503     0.50215     0.64005     0.72744       0.682     0.69003     0.10932     0.38091
     0.09737     0.27565     0.35804     0.60554     0.19323     0.28987     0.33906     0.41467     0.26256     0.30475     0.48174     0.31947     0.40777     0.30268     0.27433     0.36249     0.27121     0.11984       0.107     0.38433     0.22398     0.15883     0.36528     0.28305     0.21529     0.17351
     0.34031       0.514     0.41937     0.34501     0.25927       0.454     0.23237     0.45435     0.29694     0.66075     0.57145     0.58748     0.52857     0.19731     0.48848     0.28449     0.46431     0.36347     0.23365     0.34155     0.56282    0.099099       0.451     0.33139     0.26728     0.45347
   0.0010717        0.15]

Additional

I trained my model initially for 500 epochs on COCO dataset with lots of augmentation parameters as here

import datetime
from ultralytics import YOLO
from ultralytics.data.augment import Albumentations
from ultralytics.utils import LOGGER, colorstr
from ultralytics import settings

settings.update({"clearml": False, "tensorboard": False})

model = YOLO("./train/yolov8n_blank.yaml")

# Generate timestamps and names for directories
timestamp = datetime.datetime.now().strftime("%y%m%d_T_%H%M")
run_name = f"yolov8_run_{timestamp}"

def __init__(self, p=1.0):
    """Initialize the transform object for YOLO bbox formatted params."""
    self.p = p
    self.transform = None
    prefix = colorstr("albumentations: ")
    try:
        import albumentations as A

        # List of possible spatial transforms
        spatial_transforms = {
            "Affine",
            "BBoxSafeRandomCrop",
            "CenterCrop",
            "CoarseDropout",
            "Crop",
            "CropAndPad",
            "CropNonEmptyMaskIfExists",
            "D4",
            "ElasticTransform",
            "Flip",
            "GridDistortion",
            "GridDropout",
            "HorizontalFlip",
            "Lambda",
            "LongestMaxSize",
            "MaskDropout",
            "MixUp",
            "Morphological",
            "NoOp",
            "OpticalDistortion",
            "PadIfNeeded",
            "Perspective",
            "PiecewiseAffine",
            "PixelDropout",
            "RandomCrop",
            "RandomCropFromBorders",
            "RandomGridShuffle",
            "RandomResizedCrop",
            "RandomRotate90",
            "RandomScale",
            "RandomSizedBBoxSafeCrop",
            "RandomSizedCrop",
            "Resize",
            "Rotate",
            "SafeRotate",
            "ShiftScaleRotate",
            "SmallestMaxSize",
            "Transpose",
            "VerticalFlip",
            "XYMasking",
        }  # from https://albumentations.ai/docs/getting_started/transforms_and_targets/#spatial-level-transforms

        # Transforms
        T = [
            # Rotates the image by a random degree between -10 and 10 degrees with a 50% chance.
            A.Rotate(limit=10, p=0.5),
            # Adds a blur effect to the image with a very low probability (1%).
            A.Blur(p=0.01),
            # Applies a median blur, which helps reduce noise in the image, also with a 1% probability.
            A.MedianBlur(p=0.01),
            # Converts the image to grayscale with a 1% probability, which can be useful for certain vision tasks.
            A.ToGray(p=0.01),
            # Applies Contrast Limited Adaptive Histogram Equalization (CLAHE) with a 1% chance to improve image contrast.
            A.CLAHE(p=0.01),
            # Randomly adjusts brightness and contrast with an 80% chance, enhancing visual features.
            A.RandomBrightnessContrast(p=0.8),
            # Alters gamma values to simulate different lighting conditions, applied with an 80% probability.
            A.RandomGamma(p=0.8),
            # Compresses the image quality to a lower bound of 75% quality, applied with a 60% chance to simulate compression artifacts.
            A.ImageCompression(quality_lower=75, p=0.6),
            # # Simulates rain on the image with a 20% probability, adjusting brightness to simulate wetness.
            # A.RandomRain(p=0.2, brightness_coefficient=0.9),
            # # Adds fog to the image, with adjustable intensity and transparency, applied with a 20% probability.
            # A.RandomFog(fog_coef_lower=0.3, fog_coef_upper=1, alpha_coef=0.08, p=0.2),
            # # Applies a complex elastic transformation with a 50% chance, distorting the image in a realistic way.
            # A.ElasticTransform(alpha=2, sigma=50, alpha_affine=50, p=0.5),
            # # Simulates a sun flare effect with a specified region of interest, applied with a 20% chance.
            # A.RandomSunFlare(flare_roi=(0, 0, 1, 0.5), angle_lower=0, p=0.2),
            # # Adds a snow effect, with control over snow intensity and brightness, with a 20% probability.
            # A.RandomSnow(snow_point_lower=0.1, snow_point_upper=0.3, brightness_coeff=2.5, p=0.2),
            # # Creates random holes or black spots in the image to simulate occlusion, with a 50% chance.
            # A.CoarseDropout(max_holes=10, max_height=8, max_width=8, min_holes=1, min_height=4, min_width=4, fill_value=0, p=0.5),
            # # Adds Gaussian noise to the image, simulating sensor noise, with a 50% probability.
            # A.GaussNoise(var_limit=(10.0, 50.0), p=0.5),
            # # Simulates noise associated with high ISO settings in cameras, with adjustable color shift and intensity, applied with a 50% chance.
            # A.ISONoise(color_shift=(0.01, 0.05), intensity=(0.1, 0.5), p=0.5),
        ]

        # Compose transforms
        self.contains_spatial = any(transform.__class__.__name__ in spatial_transforms for transform in T)
        self.transform = A.Compose(T, bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"])) if self.contains_spatial else A.Compose(T)
        LOGGER.info(prefix + ", ".join(f"{x}".replace("always_apply=False, ", "") for x in T if x.p))
    except ImportError:  # package not installed, skip
        pass
    except Exception as e:
        LOGGER.info(f"{prefix}{e}")

# https://github.com/ultralytics/ultralytics/issues/7291
Albumentations.__init__ = __init__

model.train(
    data="coco.yaml",
    classes=[0], # to train it only on human class
    epochs=500,
    imgsz=640,
    patience=60,
    device=0,
    workers=16, 
    verbose=True,
    project="training_runs",
    name=run_name,
    workspace=8,
    optimizer="SGD",
    # augmentation parameters
    momentum=0.937,  # Momentum helps accelerate SGD in the relevant direction and dampens oscillations
    augment=True,  # Enable data augmentation to improve model robustness
    perspective=0.001,  # Adds a small perspective transformation to simulate different camera angles
    shear=30,  # Shears images by x degrees to simulate objects being viewed from different angles
    scale=0.7,  # Scales images by a factor to simulate objects at varying distances
    translate=0.2,  # Translates images by % of the image size to handle partially visible objects
    degrees=100,  # Randomly rotates images within a range of +/- degrees
    hsv_h=0.8,  # Adjusts hue by +/- % of the color wheel to simulate different lighting conditions
    hsv_s=0.75,  # Adjusts saturation by +/- % to enhance or reduce the intensity of colors
    hsv_v=0.85,  # Adjusts image brightness by +/- % to handle different environmental light conditions
    fliplr=0.5,  # Flips images horizontally with a % probability to increase data variability
    flipud=0.5,  # Optionally flips images vertically with a % probability
    mixup=0.5,  # Applies mixup augmentation with a % probability; blends two images and their labels
    erasing=0.5,  # Randomly erases parts of the image with a % probability to simulate occlusion
    crop_fraction=0.5,  # Crops images to % of their size to focus more on central features
    auto_augment="randaugment",  # Applies RandAugment automatically to increase variability
)
Y-T-G commented 1 month ago

Removing classes hardly speeds up the model and usually makes the accuracy worse.

Ahelsamahy commented 1 month ago

@Y-T-G how does not detecting other classes makes the detected classes worse?

Y-T-G commented 1 month ago

@Ahelsamahy Because the models learn better when you force them to distinguish between classes which is the case when you have more classes.

Ahelsamahy commented 1 month ago

@Y-T-G and this cannot be solved when you train the model with a larger database and apply different augmentation parameters to the training process?

Y-T-G commented 1 month ago

@Ahelsamahy You won't beat the original score. Training a hunan detection model is harder than it looks. Not excluding any classes also means the model gets to learn what are NOT humans from that extra data which reduces false positives.

You also some augmentations that shouldn't be used here like flipud and probably even degrees.

Ahelsamahy commented 1 month ago

@Y-T-G so you are saying that if I added more classes to my model, it should be able to detect humans better? I don't think this is how it works. Have you tested it before?

You won't beat the original score

If the original model is trained on COCO and I test it with another dataset, that my custom model is trained on, then my model will outperform the original model. It is just a matter of what the model is trained on and the test split that is used later.

Y-T-G commented 1 month ago

I don't think this is how it works. Have you tested it before?

Yeah, I have tested training a model for person detection without any other classes before, and it was always worse than pretrained model. It makes sense to me. Model's have been shown to perform better when they are trained to perform more tasks because the extra tasks introduce regularization that prevents overfitting. It's the premise behind multi-task learning https://en.m.wikipedia.org/wiki/Multi-task_learning

And what's your goal anyway? Because none of this will increase your model's speed noticeably.

If you want to test the performance, you should try running it on a video and compare them.

Ahelsamahy commented 1 month ago

I'm trying to acheive a speed of 25fps on Jetson nano. I made a discussion about it before and managed to get to a _blank model but I had to train it myself from the beginning. Glenn said it would which can streamline the computation and reduce the inference time.

I have tried other options, like setting the imgsz parameter and floating precision, which did help in the speed.

Can you explain more on the results you got from your training sessions?

Y-T-G commented 1 month ago

I had a use case where I had to improve the person detection on CCTV views, which are different from images in COCO dataset. I tried retraining the model on COCO images and the added images but the results were always worse. In the end, I just used two models, the pretrained and custom trained model.

If I were given that task now, I would use this to add an extra head for the new images: https://y-t-g.github.io/tutorials/yolov8n-add-classes/

Have you tried converting to TensorRT with int8 quantization?

Ahelsamahy commented 1 month ago

Thanks for sharing the resources. I actually came across your blog when i was preparing for the project before. I think my use case is different from yours. I'm only trying to detect humans with high precision, and you were trying to add more classes?

I think it comes back to your idea Model's have been shown to perform better when they are trained to perform more tasks but I don't want to end with a 50mb .pt model if it will reduce the speed of the model while increasing the detection for a little (correct me if i'm wrong)

I have indeed exported my model to a .engine but couldn't make it to int8=True due do limitation on the hardware of Nano

self.pt_model = YOLO("yolov8n.pt") 
# Export the model to TensorRT format with FB16 quantization and (640,480) resolution
self.pt_model.export(format="engine", device="cuda", imgsz=(self.img_size[1], self.img_size[0]), half=True)
Y-T-G commented 1 month ago

Is this the old Nano or Orin Nano? Can you upgrade JetPack?

Y-T-G commented 1 month ago

What's the FLOPs of your one class model vs. the FLOPs of the original YOLOv8n (model.info()? The reduction will be really insignificant in terms of speed.

To put it into perspective, you're only changing the size of the last few layers by reducing the nc.