ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.17k stars 16.43k forks source link

YOLOv5 (6.0/6.1) brief summary #6998

Open WZMIAOMIAO opened 2 years ago

WZMIAOMIAO commented 2 years ago

Content

1. Model Structure

YOLOv5 (v6.0/6.1) consists of:

Model structure (yolov5l.yaml):

yolov5

Some minor changes compared to previous versions:

  1. Replace the Focus structure with 6x6 Conv2d(more efficient, refer #4825)
  2. Replace the SPP structure with SPPF(more than double the speed)
test code ```python import time import torch import torch.nn as nn class SPP(nn.Module): def __init__(self): super().__init__() self.maxpool1 = nn.MaxPool2d(5, 1, padding=2) self.maxpool2 = nn.MaxPool2d(9, 1, padding=4) self.maxpool3 = nn.MaxPool2d(13, 1, padding=6) def forward(self, x): o1 = self.maxpool1(x) o2 = self.maxpool2(x) o3 = self.maxpool3(x) return torch.cat([x, o1, o2, o3], dim=1) class SPPF(nn.Module): def __init__(self): super().__init__() self.maxpool = nn.MaxPool2d(5, 1, padding=2) def forward(self, x): o1 = self.maxpool(x) o2 = self.maxpool(o1) o3 = self.maxpool(o2) return torch.cat([x, o1, o2, o3], dim=1) def main(): input_tensor = torch.rand(8, 32, 16, 16) spp = SPP() sppf = SPPF() output1 = spp(input_tensor) output2 = sppf(input_tensor) print(torch.equal(output1, output2)) t_start = time.time() for _ in range(100): spp(input_tensor) print(f"spp time: {time.time() - t_start}") t_start = time.time() for _ in range(100): sppf(input_tensor) print(f"sppf time: {time.time() - t_start}") if __name__ == '__main__': main() ``` result: ``` True spp time: 0.5373051166534424 sppf time: 0.20780706405639648 ```

2. Data Augmentation

3. Training Strategies

4. Others

4.1 Compute Losses

The YOLOv5 loss consists of three parts:

loss

4.2 Balance Losses

The objectness losses of the three prediction layers(P3, P4, P5) are weighted differently. The balance weights are [4.0, 1.0, 0.4] respectively.

obj_loss

4.3 Eliminate Grid Sensitivity

In YOLOv2 and YOLOv3, the formula for calculating the predicted target information is:

b_x
b_y
b_w
b_h

In YOLOv5, the formula is:

bx+c_x)
by+c_y)
bw^2)
bh^2)

Compare the center point offset before and after scaling. The center point offset range is adjusted from (0, 1) to (-0.5, 1.5). Therefore, offset can easily get 0 or 1.

Compare the height and width scaling ratio(relative to anchor) before and after adjustment. The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. refer this issue

4.4 Build Targets

Match positive samples:

rw

rh

rwmax

rhmax

rmax

match

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

kadirnar commented 2 years ago

@engrjav FPN and PANet are just two head architectures. Earlier versions of YOLOv5 used FPN and newer versions use PANet. CSP is a type of repeating module which as evolved into the current C3 modules. Screen Shot 2022-07-29 at 3 05 16 PM

Hi @glenn-jocher Why did you choose PANet? Is there a comparison chart? Do you think to prefer Light-BiFPN module for small models? Light-Yolov5: https://arxiv.org/pdf/2208.13422.pdf

glenn-jocher commented 2 years ago

@kadirnar BiFPN and PANet are nearly identical, in a P3-P5 output model the only difference is a single shortcut. There are versions of all 3 heads available here: https://github.com/ultralytics/yolov5/tree/master/models/hub

As always all design decisions are based on empirical results.

divided7 commented 2 years ago

Hello,can we get the results of the ablation experiment?Such as SPP2SPPF、Focus2Conv mAP results on big datasets

glenn-jocher commented 2 years ago

@divided-by-7 I'm sorry, we don't this R&D saved in a presentable manner.

dlod-openvino commented 2 years ago

@WZMIAOMIAO Could you please summarize the YOLOv5 Instance Segmentation Model Structure? especially the keywords definition of output0 float32[1,25200,117] and output1 float32[1,32,160,160]. Thank you very much in advance!

ishakpacal commented 1 year ago

Dear @glenn-jocher @WZMIAOMIAO The segmentation part is excellent. What has changed in the model architecture related to this, could you provide an example model architecture, thanks in advance.

tayahiyukon commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

XueZ-phd commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

This is a simple question. k = kernel size, s = stride, p = padding, c = channel dims

tayahiyukon commented 1 year ago

Hi! What do k, s, p, and c represent in the model structure, respectively?

This is a simple question. k = kernel size, s = stride, p = padding, c = channel dims

Okay, thank you very much!

karl-gardner commented 1 year ago

Hello @glenn-jocher or anyone who knows the answer. I am trying to understand the build targets process a little more. When you say GTx%1>0.5 and GTy%1>0.5 is the % just the modulus? If it is the modulo operator, then why is this used?

Thanks,

Karl Gardner

scraus commented 1 year ago

@WZMIAOMIAO @glenn-jocher or anyone who knows. I am trying to understand more about the model structure. Is there an article that discusses and explains the YOLOv5 structure? Thanks!

gracesmrngkr commented 1 year ago

Hi @glenn-jocher can i know what is the formula if input image 640x640x3 becomes 320x320x64 with k=3 s=2 p=1?

glenn-jocher commented 1 year ago

@gracesmrngkr this transformation is governed by the following formula:

[ \text{output_size} = \left\lfloor \frac{\text{input_size} - \text{kernel_size} + 2\times \text{padding}}{\text{stride}} \right\rfloor + 1 ]

So in this case, with an input size of 640 and a kernel size of 3, a stride of 2, and padding of 1, the output size would be 320.