One of the fastest, the other option is SSD (single shot multibox detector)
TODO been put in its own page
R-CNN (R is for "region")
Fast R-CNN
Jargon and basic concepts
bb: bounding box
anchors: offsets to reference boxes > TODO: expand
IoU
feature map
this is all from 1
Obj detection: usual CNN with last dense layer won't work as length of it is not known - number of objs to detect not fixed. Naive approach with CNNs: separate img into regions and use CNN for classification of those; but objects may overlap and have different aspect ratios and different locations. Means huge number of regions to use, which is computationally expensive.
So instead of using classification you use different methods like YOLO, SSD, R-CNN.
the pre-history: Overfeat
though it's about contemporary to R-CNN think
R-CNN
Use selective search to extract 2000 regions (region proposals): generate candidate regions initially; then greedylu combine them into larger
TODO have to read original paper
these 2000 regions are resized to square (?) and fed into a CNN (used as feat extractor) to get a 4096 feature vector
This vector passed to an SVM classifier for presence of the object in that candidate region; plus to predict 4 offset values for bbs (help if presence of object has been detected but object is only half in the bb)
Drwabacks
slow: have to do same for 2000 region proposals
selective search is used as is, so no control on region proposals quality
Fast R-CNN
Same author as R-CNN, meant to improve on speed
Instead of feeding regions to a CNN, input image fed to CNN to generate convolutional feature map
Regions identified from the conv feat map still via selective search, then resized to square
ROI pooling layer used to reshape regions into something to be fed into dense layer
Softmax layer used to predict class and offset bb values
Solves and still drawbacks
No have to feed 2000 regions to CNN, convolution done only once per image
Still uses selective search, which is slow and fixed - this makes it unuseable for real applications
Faster R-CNN
By ohter people
Same as fast R-CNN, passes full image to CNN for conv feat map
Does not use selective search for getting regions at this point, uses a region proposal network to predict them
Predicted regions reshaped via RoI pooling layer
RoI pooling layer then used to classify image within proposed region and predict offset values for bbs
Solves
No use of selective search make it usable for real applications
this from 9
SSD
Single-shot Multibox detector.
Nov 2016
record fast when out
Single-shot: one object localisation and classification pass
Multibox (see below) is name of the method for bounding box regression, by some of same authors
architecture built on VGG-16 and removes dense layers, replacing them with conv layers (to extract feats at multiple scales)
The VGG-16 phase is the most time-consuming one
in SSD, every feat map cell is associated with default bounding boxes of different dimentions and aspect ratios, unlike multibox >TODO ???
priors are manually chosen, without the pre-training phase for priors
Uses L1-Norm as location loss
Multibox
fast method for candidate bb
uses inception-like CNN
Uses categorical cross-entropy loss for confidence of detected object being object
Plus uses L2-Norm loss for localisation loss, for overlap of detected boxes to ground truth ones
the two losses are combined
it starts with priors for the anchors, uses the IoU metric to select the predicted boxes that overlap enough with the ground truth
note that Multibox does not do object classification
TODO the fact that SSD and Multibox are two different algorithms isn't clear? Also if they're different, how is it that SSD includes Multibox in the name?
Hard negative mining
Most detection at training time won't be good (low IoU), interpreted as negative training samples. They're needed to teach the model what is a bad detection, but there's a lot so it's good to set a ratio of negative to positive, set at 3:1.
Non-maximum suppression
technique to prune boxes generated at training time, to reduce noise
boxes with confidence score less than x and IoU less that y are pruned
TODOs
YOLO
TODO see what to put here and how to separate in the other notebook
From the notebook I had on this:
Object detection
Algorithms
Jargon and basic concepts
this is all from 1
Obj detection: usual CNN with last dense layer won't work as length of it is not known - number of objs to detect not fixed. Naive approach with CNNs: separate img into regions and use CNN for classification of those; but objects may overlap and have different aspect ratios and different locations. Means huge number of regions to use, which is computationally expensive.
So instead of using classification you use different methods like YOLO, SSD, R-CNN.
the pre-history: Overfeat
R-CNN
Drwabacks
Fast R-CNN
Solves and still drawbacks
Faster R-CNN
Solves
this from 9
SSD
Single-shot Multibox detector.
Multibox
Hard negative mining
Most detection at training time won't be good (low IoU), interpreted as negative training samples. They're needed to teach the model what is a bad detection, but there's a lot so it's good to set a ratio of negative to positive, set at 3:1.
Non-maximum suppression
TODOs
YOLO
R-FCN
YOLO not implemented in Tensorflow
References
References for YOLO