Open Arkkienkeli opened 6 years ago
You can see the definition of CNN, it is trying to capture the "hyper-feature" of images but can not tell exactly what the features are. E.g the 1 st Layer of CNN usually trying to capture the edges.
For the background proposition, the mask branch is designed to learn the a binary mask for each ROI, it really depends on how positive pixel are labeled in the training procedure.
Mask RCNN consists of different stages: Stage 1: This consists of resnet backbone which just consists of a punch of cnn layers to learn basic and advanced features from coco images. Stage 2: This is the Region Proposal Network (RPN) which is responsible for creating region proposals based on the "Selective Search" method. Every time the model start a new iteration, cnn features are getting updated by resnet and based on the updated features, new region proposals are created. Stage 3: This is the Fully Convolutional Network (FCN) which is responsible for creating masks for each region proposal. FCN is applied to each region proposal as if it is a separate small image by itself. It can be considered as fine tuning the features created from resnet.
Generally speaking, Mask RCNN is learning features starting from edges, shapes, up to advanced features like cars, people,.... etc.
Hi. I have a question: What Mask RCNN actually learns? Does it learn shapes? Does it learn transitions from background to object? Or both? Or something else? Thank you.