WZMIAOMIAO commented 2 years ago

Content

1. Model Structure
2. Data Augmentation
3. Training Strategies
4. Others

1. Model Structure

YOLOv5 (v6.0/6.1) consists of:

Backbone: New CSP-Darknet53
Neck: SPPF, New CSP-PAN
Head: YOLOv3 Head

Model structure (yolov5l.yaml):

yolov5

Some minor changes compared to previous versions:

Replace the Focus structure with 6x6 Conv2d(more efficient, refer #4825)
Replace the SPP structure with SPPF(more than double the speed)

test code

```python import time import torch import torch.nn as nn class SPP(nn.Module): def __init__(self): super().__init__() self.maxpool1 = nn.MaxPool2d(5, 1, padding=2) self.maxpool2 = nn.MaxPool2d(9, 1, padding=4) self.maxpool3 = nn.MaxPool2d(13, 1, padding=6) def forward(self, x): o1 = self.maxpool1(x) o2 = self.maxpool2(x) o3 = self.maxpool3(x) return torch.cat([x, o1, o2, o3], dim=1) class SPPF(nn.Module): def __init__(self): super().__init__() self.maxpool = nn.MaxPool2d(5, 1, padding=2) def forward(self, x): o1 = self.maxpool(x) o2 = self.maxpool(o1) o3 = self.maxpool(o2) return torch.cat([x, o1, o2, o3], dim=1) def main(): input_tensor = torch.rand(8, 32, 16, 16) spp = SPP() sppf = SPPF() output1 = spp(input_tensor) output2 = sppf(input_tensor) print(torch.equal(output1, output2)) t_start = time.time() for _ in range(100): spp(input_tensor) print(f"spp time: {time.time() - t_start}") t_start = time.time() for _ in range(100): sppf(input_tensor) print(f"sppf time: {time.time() - t_start}") if __name__ == '__main__': main() ``` result: ``` True spp time: 0.5373051166534424 sppf time: 0.20780706405639648 ```

2. Data Augmentation

Mosaic
Copy paste
Random affine(Rotation, Scale, Translation and Shear)
MixUp
Albumentations
Augment HSV(Hue, Saturation, Value)
Random horizontal flip

3. Training Strategies

Multi-scale training(0.5~1.5x)
AutoAnchor(For training custom data)
Warmup and Cosine LR scheduler
EMA(Exponential Moving Average)
Mixed precision
Evolve hyper-parameters

4. Others

4.1 Compute Losses

The YOLOv5 loss consists of three parts:

Classes loss(BCE loss)
Objectness loss(BCE loss)
Location loss(CIoU loss)

$loss$

4.2 Balance Losses

The objectness losses of the three prediction layers(P3, P4, P5) are weighted differently. The balance weights are [4.0, 1.0, 0.4] respectively.

$obj_loss$

4.3 Eliminate Grid Sensitivity

In YOLOv2 and YOLOv3, the formula for calculating the predicted target information is:

$b_x$
$b_y$
$b_w$
$b_h$

In YOLOv5, the formula is:

$bx$ +c_x)
$by$ +c_y)
$bw$ ^2)
$bh$ ^2)

Compare the center point offset before and after scaling. The center point offset range is adjusted from (0, 1) to (-0.5, 1.5). Therefore, offset can easily get 0 or 1.

Compare the height and width scaling ratio(relative to anchor) before and after adjustment. The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. refer this issue

4.4 Build Targets

Match positive samples:

Calculate the aspect ratio of GT and Anchor Templates

$rw$

$rh$

$rwmax$

$rhmax$

$rmax$

$match$

Assign the successfully matched Anchor Templates to the corresponding cells
Because the center point offset range is adjusted from (0, 1) to (-0.5, 1.5). GT Box can be assigned to more anchors.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

kadirnar commented 2 years ago

@engrjav FPN and PANet are just two head architectures. Earlier versions of YOLOv5 used FPN and newer versions use PANet. CSP is a type of repeating module which as evolved into the current C3 modules.

Hi @glenn-jocher Why did you choose PANet? Is there a comparison chart? Do you think to prefer Light-BiFPN module for small models? Light-Yolov5: https://arxiv.org/pdf/2208.13422.pdf

glenn-jocher commented 2 years ago

@kadirnar BiFPN and PANet are nearly identical, in a P3-P5 output model the only difference is a single shortcut. There are versions of all 3 heads available here: https://github.com/ultralytics/yolov5/tree/master/models/hub

As always all design decisions are based on empirical results.

divided7 commented 2 years ago

Hello，can we get the results of the ablation experiment？Such as SPP2SPPF、Focus2Conv mAP results on big datasets

glenn-jocher commented 2 years ago

@divided-by-7 I'm sorry, we don't this R&D saved in a presentable manner.

dlod-openvino commented 2 years ago

@WZMIAOMIAO Could you please summarize the YOLOv5 Instance Segmentation Model Structure? especially the keywords definition of output0 float32[1,25200,117] and output1 float32[1,32,160,160]. Thank you very much in advance!

ishakpacal commented 1 year ago

Dear @glenn-jocher @WZMIAOMIAO The segmentation part is excellent. What has changed in the model architecture related to this, could you provide an example model architecture, thanks in advance.

tayahiyukon commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

XueZ-phd commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

This is a simple question. k = kernel size, s = stride, p = padding, c = channel dims

tayahiyukon commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

This is a simple question. k = kernel size, s = stride, p = padding, c = channel dims

Okay, thank you very much!

karl-gardner commented 1 year ago

Hello @glenn-jocher or anyone who knows the answer. I am trying to understand the build targets process a little more. When you say GTx%1>0.5 and GTy%1>0.5 is the % just the modulus? If it is the modulo operator, then why is this used?

Thanks,

Karl Gardner

scraus commented 1 year ago

@WZMIAOMIAO @glenn-jocher or anyone who knows. I am trying to understand more about the model structure. Is there an article that discusses and explains the YOLOv5 structure? Thanks!

gracesmrngkr commented 1 year ago

Hi @glenn-jocher can i know what is the formula if input image 640x640x3 becomes 320x320x64 with k=3 s=2 p=1?

glenn-jocher commented 1 year ago

@gracesmrngkr this transformation is governed by the following formula:

[ \text{output_size} = \left\lfloor \frac{\text{input_size} - \text{kernel_size} + 2\times \text{padding}}{\text{stride}} \right\rfloor + 1 ]

So in this case, with an input size of 640 and a kernel size of 3, a stride of 2, and padding of 1, the output size would be 320.

ultralytics / yolov5

YOLOv5 (6.0/6.1) brief summary #6998