jl749 commented 2 years ago

YOLOv3 makes prediction across 3 different scales (13x13, 26x26, 52x52)<-- in case of 416x416 input.

The detection layer is used to make prediction at feature maps of three different sizes, having strides 32, 16, 8

416/32 = 13
416/16 = 26
416/8 = 52

In total predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes

detection is done by using a 1x1 kernel on the feature maps

Yolov3 uses independent logistic classifiers in place of the softmax function to determine the class of an input image. It also replaces the mean squared error with the binary cross-entropy loss, in simpler terms, the probability of object in the image and the class predictions are done using logistic regression.

What are the anchor boxes in YOLOv3?

After doing some clustering studies(K-means) on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes. Instead of predicting bounding boxes from scratch adjust clustered anchor boxes w, h. (easier + faster) -darknet/issues/568-

tx, ty, tw, th = logits (raw output of NN) Pw, Ph = width, height of the pre clustered anchor boxes

sigmoid on tx, ty to put them in between 0 ~ 1

https://github.com/jl749/YOLOv3/blob/bc36f282c841d6c23940c1bb32d09f88d7c6aa86/yolov3/utils/functions.py#L112-L117

jl749 commented 2 years ago

torch.nn.Upsample()

jl749 commented 2 years ago

YOLOv3 vs v4

jl749 commented 2 years ago

FPN (Feature Pyramid Network)

FPN is not an obj detector by itself. It is a feature extractor that works with object detector.

GOAL: feature integration

FPN is a general purpose architecture which is independent of the backbone

FPN consists of three components

Bottom-up
Top-down
Lateral connection

YOLOv1 has used (b)
YOLOv3 v4 v5 using (d)

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. It can detect different sized objects(small, medium, big) + less memory intensive

problem with conventional way (b)

CNN computes a feature hierarchy layer by layer. However, there are large semantic gaps between layer

solution using FPN

Naturally leverage the pyramidal shape of a ConvNet's feature hierarchy while creating a feature pyramid that has strong semantics at all scales
Combining semantically strong features with semantically weak features via a top-down pathway and lateral connection (d)

jl749 / YOLOv3

YOLOv3 architecture #1

What are the anchor boxes in YOLOv3?

torch.nn.Upsample()

FPN (Feature Pyramid Network)

problem with conventional way (b)

solution using FPN