Open jl749 opened 2 years ago
After doing some clustering studies(K-means) on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes. Instead of predicting bounding boxes from scratch adjust clustered anchor boxes w, h. (easier + faster) -darknet/issues/568-
tx, ty, tw, th = logits (raw output of NN) Pw, Ph = width, height of the pre clustered anchor boxes
sigmoid on tx, ty to put them in between 0 ~ 1
FPN is not an obj detector by itself. It is a feature extractor that works with object detector.
GOAL: feature integration
FPN is a general purpose architecture which is independent of the backbone
FPN consists of three components
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. It can detect different sized objects(small, medium, big) + less memory intensive
CNN computes a feature hierarchy layer by layer. However, there are large semantic gaps between layer
YOLOv3 makes prediction across 3 different scales (13x13, 26x26, 52x52)<-- in case of 416x416 input.
The detection layer is used to make prediction at feature maps of three different sizes, having strides 32, 16, 8
In total predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647 bounding boxes
detection is done by using a 1x1 kernel on the feature maps
Yolov3 uses independent logistic classifiers in place of the softmax function to determine the class of an input image. It also replaces the mean squared error with the binary cross-entropy loss, in simpler terms, the probability of object in the image and the class predictions are done using logistic regression.
more read