jl749 / YOLOv3

yolov3 implementation in pytorch (https://arxiv.org/pdf/1804.02767.pdf)
0 stars 0 forks source link

YOLOv3 architecture #1

Open jl749 opened 2 years ago

jl749 commented 2 years ago

image image

YOLOv3 makes prediction across 3 different scales (13x13, 26x26, 52x52)<-- in case of 416x416 input.

The detection layer is used to make prediction at feature maps of three different sizes, having strides 32, 16, 8

416/32 = 13
416/16 = 26
416/8 = 52

In total predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes

detection is done by using a 1x1 kernel on the feature maps

Yolov3 uses independent logistic classifiers in place of the softmax function to determine the class of an input image. It also replaces the mean squared error with the binary cross-entropy loss, in simpler terms, the probability of object in the image and the class predictions are done using logistic regression.

more read

jl749 commented 2 years ago

What are the anchor boxes in YOLOv3?

After doing some clustering studies(K-means) on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes. Instead of predicting bounding boxes from scratch adjust clustered anchor boxes w, h. (easier + faster) -darknet/issues/568-

image tx, ty, tw, th = logits (raw output of NN) Pw, Ph = width, height of the pre clustered anchor boxes

sigmoid on tx, ty to put them in between 0 ~ 1

https://github.com/jl749/YOLOv3/blob/bc36f282c841d6c23940c1bb32d09f88d7c6aa86/yolov3/utils/functions.py#L112-L117

jl749 commented 2 years ago

torch.nn.Upsample()

image

jl749 commented 2 years ago

YOLOv3 vs v4

jl749 commented 2 years ago

FPN (Feature Pyramid Network)

FPN is not an obj detector by itself. It is a feature extractor that works with object detector.

GOAL: feature integration

FPN is a general purpose architecture which is independent of the backbone

FPN consists of three components

image

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. It can detect different sized objects(small, medium, big) + less memory intensive

problem with conventional way (b)

CNN computes a feature hierarchy layer by layer. However, there are large semantic gaps between layer

solution using FPN