DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency Detection

What is instance co-segmentation?

Given a set of images jointly covering object instances of a specific category, instance co-segmentation aims to identify all of these instances and segment each of them, i.e. generating one mask for each instance.
Unlike semantic or instance segmenetation, no annotated masks and the number of instances in each image is unknown.
Previous methods for instance segmentation rely on annotated data to learn the models. Despite the effectiveness and efficiency in testing, their learned models are not applicable to unseen object categories.
Their method does not need any pre-training procedure on additional data annotations.
However, they simplify the task by constraining only one specific category given a set of images.

Method

Variables in the data flow

Input: A pair of image I_n (W, H, c), I_m (W, H, c) consisting of object instances of a specific category.
Two feature maps output from g (pretrained VGG-16; not fixed): F_n (w, h, d); F_m (w, h, d).
(Inter image) correlation tensor T_{nm}: (w, h, w, h). where T[i,j,s,t] = normalized inner product of F_n[i,j] and F_m[s,t].
(Intra-image) saliency map \widetilde{S_k} (w, h) (k ∈ {n, m}); S_k (W, H) is high-resolution co-saliency map.
To jointly consider two information above, a saliency-guided correlation tensor T{nm}^{s} (w,h,w,h) is constructed with its elements defined as: T{nm}^{s} [i,j,s,t] = \widetilde{S_n}[i,j] \widetilde{Sm}[s,t] T{nm}[i,j,s,t].
Instance-aware heatmaps {O_n^i}.

Divide the task into Co-peak Search and Instance Mask Segmentation

Co-peak search: Develop a CNN to detect the co-peaks and co-saliency maps for a pair of image.
Instance mask segmentation: Takes the detected co-peaks and co-saliency maps, and can select the object proposals to produce the final results.

Co-peak search

(Figure credit: Zhou et al.)

Loss overview:

Inspired by Zhou et al. object instances often cover the peaks in a response map of a classier.
Propose co-peak loss to detect the co-peaks in two image.
Propose affinity loss and saliency loss to complement co-peak loss for avoiding false positives and negatives.
Affinity loss: Separating foreground and background features.
Saliency loss: Estimates saliency maps to localize the co-salient objects in an image, making model focus on co-peak search in co-saliency regions.

Co-peak loss:

A "co-peak" is defined as a local maximum in T_{nm}^{s} within a 4D local window of size 3x3x3x3 (so there will a set of co-peaks).
Suppose T_{nm}^{s}[p,q] is a co-peak (p=[i,j]; q=[s,t]), which means:
- (1) F_n[p] and F_m[q] are salient and are the most similar to each other (may reside in salient object instances).
- (2) F_n[p] and F_m[q] may also likely be the same object category.
Finally, loss is defined as (maximizing co-peaks):

Affinity loss:

The affinity loss aims to makes the features in salient regions are similar to each other while being distinct from those in the background.

(Note that this equation is corrected by the author, different from the original paper.)

In the first term, if p and q are saliency pixels, the product of their saliency values (\hat{S_n(p)} \times \hat{Sm(q)}) should be high. Therefore, the high affinity between p and q should be enforced, and thus we want to make T{nm}(p,q) higher. However, this is a minimization problem, so we minimize (1-T_{nm}(p,q)) instead.
In the second term, if p is a saliency pixel and q is a background pixel, the difference of their saliency value (\hat{S_n(p)} - \hat{Sm(q)}) should be high. Therefore, the low affinity between p and q should be enforced, and thus we want to make T{nm}(p,q) lower.

The proposed affinity loss generalizes eq (4) to consider both inter-image and intra-image affinities:

Saliency loss (follows Hsu et al.):

Use the off-the-shelf unsupervised saliency detection method, SVFSal (Zhang et al.), which produces the saliency map \hat{S_n} for image I_n.
\rho_{n}(p) is a weight representing the importance of pixel p, and S_n is the predicted saliency map for I_n by our model, dealing with the imbalance between the salient and non-salient pixels.
\rho_{n}(p) is set to (1-ε) if p is salient and ε vice versa. ε is the ratio of the salient area in whole image.
The mean value of \hat{S_n} is used as the threshold to divide \hat{S_n} into the salient and non-salient regions. In this way, the salient and non-salient regions contribute equally.
Except for deconv layer, our model produces maps {S_n} derived by the three losses jointly (thus called co-saliency maps).

Instance mask segmentation

For each peak p_n^i in image I_n, run peak back-propagation (Zhou et al.) to produce instance-aware heatmap O_n^i.
Utilized an unsupervised method, multi-scale combinatorial grouping (MCG; Jordi et al.) to produce a set of instance proposals for image I_n.
Extend the ranking function in Zhou et al, and select the top-ranked proposal as the mask for each detected peak. (\hat{P} is the contour of the proposal P and the operator * denotes Frobenius inner product)
Perform non-maximum suppression (NMS) in the end to remove redundancies.

Evaluation

General preprocessing

Remove the images where objects of more than one category are presented.
Discard categories that contains less than 10 images.

Datasets

MS COCO 2017:
- Remove images that do not contain at least two instances, remaining 44 categories.
- For competing with methods trained on VOC, divide COCO into COCO-VOC (12 categories covered by VOC) and COCO-NONVOC (remaining 32 categories).
PASCAL VOC 2012: 18 categories remain after general preprocessing.
SOC: A dataset for saliency detection (contain image-level labels and instance-aware annotations). 5 categories remain 5 categories.

Performance

Since this is a new task, they propose to compare with methods of object co-localization, class-agnostic saliency segmentation, and weakly supervised instance segmentation.
Converting a bounding box to an instance segment: Apply MCG to that image to generate a set of instance proposals and retrieve the proposal with the highest IOU with the bounding box to represent it.
Converting an instance segment to a bounding box: Simply use the bounding box of that instance segment.

Ablation studies

Qualitative results

How is the result computed given 3 images? After optimizing Eq. (1), we simply use the detected peaks on the estimated co-saliency maps as the final co-peaks, because detecting the co-peaks on all possible image pairs is complicated.

How are the (local) peaks sampled from the predicted co-saliency map for an image? They use the 3 x 3 local window to sample peaks on the co-saliency map and then with the sampled peaks, the peak back-propagation proposed in PRM is adopted.

Related Work

Weakly supervised instance segmentation using class peak response by Zhou et al. CVPR 2018 (Spotlight).
Supervision by fusion: Towards unsupervised learning of deep salient object detector by Zhang et al. ICCV 2017.
Unsupervised CNN-based cosaliency detection with graphical optimization by Hsu et al. (the same group of this work). ECCV 2018.
Multiscale combinatorial grouping for image segmentation and object proposal generation by Jordi et al. TPAMI 2017.

howardyclo / papernotes

DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency Detection #51

Metadata