Overview of model structure about YOLOv5

seekFire commented 4 years ago

In order to understand the structure of YOLOv5 and use other frameworks to implement YOLOv5, I try to create an overview, as shown below. If there has any error, please point out yolov5

github-actions[bot] commented 4 years ago

Hello @seekFire, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@seekFire yes looks correct!

TaoXieSZ commented 4 years ago

@seekFire That looks pretty and clean. What kind of drawing tool you use?

seekFire commented 4 years ago

@ChristopherSTAN Just PowerPoint

seekFire commented 4 years ago

@glenn-jocher Thank you for your confirmation！

bretagne-peiqi commented 4 years ago

yolov5

Hello, I also made one, if there is any error, please help me point out : )

glenn-jocher commented 4 years ago

@bretagne-peiqi yes this looks correct, except that with the v2.0 release the 3 output Conv2d() boxes (red in your diagram) are now inside the Detect() stage:

    (24): Detect(
      (m): ModuleList(
        (0): Conv2d(128, 255, kernel_size=(1, 1), stride=(1, 1))
        (1): Conv2d(256, 255, kernel_size=(1, 1), stride=(1, 1))
        (2): Conv2d(512, 255, kernel_size=(1, 1), stride=(1, 1))
      )

glenn-jocher commented 4 years ago

@bretagne-peiqi ah, also you have an FPN head here, whereas the more recent YOLOv5 models have PANet heads. See https://github.com/ultralytics/yolov5/blob/master/models/yolov5s.yaml

bretagne-peiqi commented 4 years ago

@glenn-jocher many thanks.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaobaorong commented 4 years ago

Good!

pravastacaraka commented 3 years ago

@seekFire @bretagne-peiqi @glenn-jocher do you guys have an overview diagram for this YOLOv4 v4.0?

zhiqwang commented 3 years ago

Hi @pravastacaraka , here is a overview of YOLOv5 v4.0 , actually it looks very similar to the previous version, here is the v3.1 version.

I copied this diagram from here, it is written in Chinese.

Copyright statement: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting. Link to this article: https://blog.csdn.net/Q1u1NG/article/details/107511465

pravastacaraka commented 3 years ago

@zhiqwang thank you so much for your kind help

ehdrndd commented 3 years ago

well, I updated architecture

data4pass commented 3 years ago

My apologies if this question is too beginner-level, but I would like to ask, what operation is it exactly that is used to "combine" the three predictions that we got from the detection layers?

glenn-jocher commented 3 years ago

@data4pass all detection heads concatenate together (along dimension 1) into a single output in the YOLOv5 Detect() layer: https://github.com/ultralytics/yolov5/blob/a820b43aca3816c9552e9beaf14a77955742b0ec/models/yolo.py#L73

data4pass commented 3 years ago

Understood, but don't the three resulting tensors have different shapes? Don't we have to reshape the tensors somehow so that they can be concatenated?

glenn-jocher commented 3 years ago

@data4pass see Detect() layer for reshape ops: https://github.com/ultralytics/yolov5/blob/ba99092304a2ee715b6fb954b437b2d081203794/models/yolo.py#L36

Zengyf-CVer commented 2 years ago

well, I updated architecture

Hello @ehdrndd , what software did you use to make this picture?

yyccR commented 2 years ago

The latest structure looks clean and simple

glenn-jocher commented 2 years ago

@yyccR very nice!

zhiqwang commented 2 years ago

@yyccR , It's awesome! The structure of the latest YOLOv5 v6.0 is symmetrical (Especially for the PAN module), and this visualization demonstrates this thing, it also works for P6.

myasser63 commented 2 years ago

@glenn-jocher Concat layer for P4 should be between Conv and Conv not c3 as in the figure?

joangog commented 2 years ago

@glenn-jocher Concat layer for P4 should be between Conv and Conv not c3 as in the figure?

I also have the same question if you can answer please. @glenn-jocher

joangog commented 2 years ago

The latest structure looks clean and simple

Why in the outputs the third dimension is 85? @yyccR

yyccR commented 2 years ago

@joangog 85=80(Number of categories in my dataset)+4(x,y,w,h)+1(objectness)

Liqq1 commented 2 years ago

In order to understand the structure of YOLOv5 and use other frameworks to implement YOLOv5, I try to create an overview, as shown below. If there has any error, please point out

Hi, thanks for the great structure drawing! Can I use it in my undergraduate thesis as a theoretical introduction to YOLOv5, please? （I will imitate it and redraw one myself using PPT.

glenn-jocher commented 2 years ago

@Liqq1 I'm not the original author of the diagram but I would say yes!

seekFire commented 2 years ago

@Liqq1 Absolutely yes! help yourself~^_^

Liqq1 commented 2 years ago

@Liqq1 I'm not the original author of the diagram but I would say yes!

haha～thanks！😁

Liqq1 commented 2 years ago

@Liqq1 Absolutely yes! help yourself~^_^

👏👏😻 thanks～！

wwdok commented 2 years ago

This is my understanding of yolov5 s v6.1 model structure: yolov5

zhiqwang commented 2 years ago

Hi @wwdok , It's great! And I think it would be better if the input image channel could be adjusted to RGB that YOLOv5 is currently using.

wwdok commented 2 years ago

@zhiqiang Do you mean my illustrated image channel is BGR ?😄 I know it is RGB, it is really a little ambiguous and depends on the reading order.

zhiqwang commented 2 years ago

Hi @wwdok , Got it!

imdadulhaque1 commented 2 years ago

@seekFire Thanks a lot for making a clear diagram of YOLOv5.

jeannotdamoiseaux commented 2 years ago

Regarding the original structure posted by @seekFire, could somebody clarify why there are 5 BottleneckCSP modules in the PANet part? In the documentation there are only 4. Based on the documentation, it seems to me that the BottleneckCSP module in the bottom-left of PANnet should be removed and the arrow of SPP should be connected to the Conv (1x1) module. Most likely I am wrong, but could somebody maybe clarify why?

glenn-jocher commented 2 years ago

@jeannot-github 👋 Hello! Thanks for asking about YOLOv5 🚀 architecture visualization. We've made visualizing YOLOv5 🚀 architectures easy, there are 3 main ways below. To answer your question though P5 heads (i.e. YOLOv5s) contain 4 C3 layers and P6 heads (i.e. YOLOv5s6) contain 6 C3 layers. You can compare the model yamls here:

https://github.com/ultralytics/yolov5/blob/master/models/yolov5s.yaml https://github.com/ultralytics/yolov5/blob/master/models/hub/yolov5s6.yaml

`model.yaml`

Each model has a corresponding yaml file that displays the model architecture. Here is YOLOv5s, defined by yolov5s.yaml: https://github.com/ultralytics/yolov5/blob/1a3ecb8b386115fd22129eaf0760157b161efac7/models/yolov5s.yaml#L12-L48

TensorBoard Graph

Simply start training a model, and then view the TensorBoard Graph for an interactive view of the model architecture. This example shows YOLOv5s viewed in our Notebook –

# Tensorboard
%load_ext tensorboard
%tensorboard --logdir runs/train

# Train YOLOv5s on COCO128 for 3 epochs
python train.py --weights yolov5s.pt --epochs 3

Netron viewer

Use https://netron.app to view exported ONNX models:

python export.py --weights yolov5s.pt --include onnx --simplify

Good luck 🍀 and let us know if you have any other questions!

jeannotdamoiseaux commented 2 years ago

Hi @glenn-jocher , my question remains the same. in yolov5s.yaml, the backbone ends with an SPPF module, and the head starts with Conv module, which retrieves it's input from the previous module. Thus, I would expect that the SPPF output serves as input for the first Conv module in the head. In these graphs, I observe something differently. Also, I expect 4 C3 modules in the head based on yolov5s.yaml, but I observe 5 in these graphs. Could you explain where this difference comes from?

yolov5/models/yolov5s.yaml

glenn-jocher commented 2 years ago

@jeannot-github yes SPPF feeds directly to head first Conv layer.

Graphs above are community contributions and may not be correct (or may be outdated). The 3 sources I mentioned above are the best sources of truth.

jeannotdamoiseaux commented 2 years ago

@glenn-jocher, in your graph I observe exactly the same behavior: SPPF is followed by a C3 module.

glenn-jocher commented 2 years ago

@jeannot-github sorry, TensorBoard screenshot is outdated (YOLOv5 has many releases, currently on v6.1). v6.1 TensorBoard is here. You can use commands above to reproduce and introspect yourself.

jeannotdamoiseaux commented 2 years ago

Great, thanks a lot for your help and clarification!

joonjeon commented 2 years ago

Hi seekFire!

I am planning to write a paper on the YOLOv5x6 performance in embedded devices, and I am wondering if I could make use of the explanatory diagram as well. Thanks in advance!

sweetygupta17 commented 1 year ago

@glenn-jocher @zhiqwang @yyccR

Can you clarify how concat is working and what shape it will return during PANNet between C3(P4) and SPPF after upscaling concatenate is it adding the channels, or element-wise addition
How C3 will work after 3rd Concat when input and output channels are different

Symbadian commented 1 year ago

The latest structure looks clean and simple

Why in the outputs the third dimension is 85? @yyccR

Hi @joangog @glenn-jocher, thank you for your image however, I was hoping to get the understanding of some of the details in your image especial for the first layer and the output layer. Can you please assist me here? Thank you's in advance.

I tried researching this on google and the relative communities, I am sure that I missed the explanation some where. Please help to understand, I am unable to grasp these concepts.

I have an image of size 416 x 416 x 3 channels and I am trying to follow your details to draw my own diagram for my report. I am almost there, however, I am missing the understanding of the image dimensionality per the processing of every layer.
For example, in the first layer you have an (input of 3 frames that outputs 64 feature maps?) Can you elaborate a litter further here, please? I am not sure what the K = 6 and S = 2 means!!!?!?! As well as C3, is this a different type of convolution?
additionally, what does the 1/2, 1/4, 1/8, 1/16. 1/32 means??? Is that the image size as per your image and if yes, what would this be regarding my input size?
In the output I saw the relevance of the explanation of the 85 reflecting the number of class categories within the dataset, does the 3 reflect the RGB channels?

Can you guide me here with 1,2,3 & 4, please? Your assistance ensures that I have the opportunity to move forward..

thank you loads in advance!!

Screenshot 2023-04-03 at 08 52 58

QiangYanHuang commented 1 year ago

① K: kernel_size； S: stride ② 1/2 means that after passing through this layer, the output image size is 1/2 of the original image, and "1/4, 1/8, 1/16" is the same interpretation. "s=2" makes the image size 1/2 of the previous layer. You can search for "upsampling", "subsampled" to understand . ③ 85: There is class "80" + confidence "1" + border regression "4" in the coco dataset.@Symbadian

Symbadian commented 1 year ago

① K: kernel_size； S: stride ② 1/2 means that after passing through this layer, the output image size is 1/2 of the original image, and "1/4, 1/8, 1/16" is the same interpretation. "s=2" makes the image size 1/2 of the previous layer. You can search for "upsampling", "subsampled" to understand . ③ 85: There is class "80" + confidence "1" + border regression "4" in the coco dataset.@Symbadian

thank you for your swift reply, I will make the necessary adjustments. thanx once more

Symbadian commented 1 year ago

① K: kernel_size； S: stride ② 1/2 means that after passing through this layer, the output image size is 1/2 of the original image, and "1/4, 1/8, 1/16" is the same interpretation. "s=2" makes the image size 1/2 of the previous layer. You can search for "upsampling", "subsampled" to understand . ③ 85: There is class "80" + confidence "1" + border regression "4" in the coco dataset.@Symbadian

Hey @QiangYanHuang forgive me meticulousness, Can you guide me on what (n = block) refers to, please? is this the size of the output block?

thanx in advance for your input! really appreciate this!

ultralytics / yolov5