peter0749 / object-referring

1 stars 0 forks source link

MAttNet: Modular Attention Network for Referring Expression Comprehension #1

Open peter0749 opened 5 years ago

peter0749 commented 5 years ago

MAttNet: Modular Attention Network for Referring Expression Comprehension

Original paper: MAttNet: Modular Attention Network for Referring Expression Comprehension

Journal/Conference: CVPR2018

Authors: Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L.Berg

Objective

This work propose a modular network for referring expression comprehension - Modular Attention Network (MAttNet) - that takes a natural language expression as input and softly decomposes it into three phrase embeddings (for subject, location, and relationship comprehension). These embeddings are used to trigger three separate visual modules which are finally combined into an overall region score based on the module weights.

There are three main novelties in MAttNet:

Related works

Dataset

For "RefCOCO" and "RefCOCO+" there two different sets:

For "RefCOCOg" there are two different sets too:

Most experiments run on RefCOCOg-val

Methodology

Model

Language Attention Network

We first embed each word $u_t$ into a vector $e_t$ using an one-hot word embedding, then a bidirectional LSTM-RNN is applied to encode the whole expression.

$$ \begin{aligned} e{t} &=\operatorname{embedding}\left(u{t}\right) \ \overrightarrow{h}{t} &=\operatorname{LSTM}\left(e{t}, \overrightarrow{h}{t-1}\right) \ \overleftarrow{h}{t} &=\operatorname{LSTM}\left(e{t}, \overleftarrow{h}{t+1}\right) \ h{t} &=\left[\overrightarrow{h}{t}, \overleftarrow{h}_{t}\right] \end{aligned} $$

Given $H=\left{h{t}\right}{t=1}^{T}$ , we apply three trainable vectors $f_m$ where $m \in{\operatorname{subj}, \operatorname{loc}, \mathrm{rel}}$, computing the attention on each word for each module:

$$ a{m, t}=\frac{\exp \left(f{m}^{T} h{t}\right)}{\sum{k=1}^{T} \exp \left(f{m}^{T} h{k}\right)} $$

The weighted sum of word embeddings is used as the modular phrase embedding:

$$ q^{m}=\sum{t=1}^{T} a{m, t} e_{t} $$

We compute 3 module weights for the expression, weighting how much each module contributes to the expression-object score.

We concatenate the first and last hidden vectors from $H$ which memorizes both structure and semantics of the whole expression, then use another fully-connected (FC) layer to transform it into 3 module weights:

$$ \left[w{s u b j}, w{l o c}, w{r e l}\right]=\operatorname{softmax}\left(W{m}^{T}\left[h{0}, h{T}\right]+b_{m}\right) $$

Visual Modules

We compute the matching score for each candidate objects $o_i$ given each modular phrase embedding, i.e., $S(oi|q{subj}), S(oi|q{loc}), S(oi|q{rel})$.

Subject Module

Two tasks:

Attribute Prediction

Attributes are frequently used in referring expressions to differentiate between objects of the same category.

While preparing the attribute labels in the training set, we first run a template parser [13] to obtain color and generic attribute words.

A binary cross-entropy loss is used for multi-attribute classification:

$$ L{s u b j}^{a t t r}=\lambda{a t t r} \sum{i} \sum{j} w{j}^{a t t r}\left[\log \left(p{i j}\right)+\left(1-y{i j}\right) \log \left(1-p{i j}\right)\right] $$ Where $w{j}^{a t t r}=1 / \sqrt{\mathrm{freq}{a t t r}}$

Note: $w_j^{attr}$ 用來平衡 label

Phrase-guided Attentional Pooling

We allow our subject module to localize relevant regions within a bounding box through “in-box” attention.

We use a 1×1 convolution to fuse attribute blob and C4 into a subject blob $V \in R^{d \times G}$, where G=14x14. Given the subject phrase embedding $q{subj}$, we compute its attention on each grid location: $$ \begin{aligned} H{a} &=\tanh \left(W{v} V+W{q} q^{s u b j}\right) \ a^{v} &=\operatorname{softmax}\left(w{h, a}^{T} H{a}\right) \end{aligned} $$

the final subject visual representation for the candidate region $oi$: $$ \widetilde{v}{i}^{s u b j}=\sum{i=1}^{G} a{i}^{v} v_{i} $$

Matching Function

We measure the similarity $S\left(o{i} | q^{s u b j}\right) = F\left(\tilde{v}{i}^{s u b j}, q^{s u b j}\right)$ between the subject representation $v􏰂i$ and phrase embedding $q{subj}$ using a matching function $F$ as shown in Fig. 3. The same matching function is used to compute the location score and relationship score: $S\left(o{i} | q^{l o c}\right)$ and $S\left(o{i} | q^{rel}\right)$ respectively.

Location Module

Together: $$ \widetilde{l}{i}^{\text {loc}}=W{l}\left[l{i} ; \delta l{i}\right]+b_{l} $$

Location matching score: $$ S\left(o{i} | q^{l o c}\right)=F\left(\widetilde{l}{i}^{l o c}, q^{l o c}\right) $$

Relationship Module

While the subject module deals with "in-box" details about the target object, some other expressions may involve its relationship with other "out-of-box" objects, e.g., "cat on chaise lounge". The relationship module is used to address these cases.

As in Fig. 5, given a candidate object oi we first look for its surrounding (up-to-five) objects $o_{ij}$ regardless of their categories.

We use the average-pooled C4 feature as the appearance feature $v_{ij}$ of each supporting object.

問題:$v_{ij}$ 怎麼求?翻閱程式碼後了解到其實就是 5 個 object 的 visual embedding。見 joint_match.py 第 93 行。並沒有考慮 pair-wise 的關係。也就是說: $\forall i,j,k \in {1,2,3,...,n }, v{ij} = v{kj}$ 其中 $n$ 是周遭物體數量。

Loss Function

The overall weighted matching score for candidate object $oi$ and expression $r$ is: $$ S\left(o{i} | r\right)=w{s u b j} S\left(o{i} | q^{s u b j}\right)+w{l o c} S\left(o{i} | q^{l o c}\right)+w{r e l} S\left(o{i} | q^{r e l}\right) $$

We randomly sample two negative pairs $(o_i , r_j )$ and $(o_k , r_i )$, where $r_j$ is the expression describing some other object and $ok$ is some other object in the same image. $$ \begin{aligned} L{r a n k}=\sum{i} &\left[\lambda{1} \max \left(0, \Delta+S\left(o{i} | r{j}\right)-S\left(o{i} | r{i}\right)\right)\right.\ &\left.+\lambda{2} \max \left(0, \Delta+S\left(o{k} | r{i}\right)-S\left(o{i} | r_{i}\right)\right)\right] \end{aligned} $$

Together: $$ L=L{s u b j}^{a t t r}+L{r a n k} $$

Result

Referring Expression Comprehension

Segmentation from Referring Expression

Conclusion

Our modular attention network addresses variance in referring expressions by attending to both relevant words and visual regions in a modular framework, and dynamically computing an overall matching score. We demonstrate our model’s effectiveness on bounding-box-level and pixel-level comprehension, significantly outperforming state-of-the-art.

Thoughts


Link for code/model/dataset

Demo site:

http://vision2.cs.unc.edu/refer/comprehension

Source code for training/evaluation:

https://github.com/lichengunc/MAttNet

Datasets (RefCOCO, RefCOCO+, RefCOCOg): https://github.com/lichengunc/refer


References

[1] J. Andreas, D. Klein, and S. Levine. Modular multitask re- inforcement learning with policy sketches. ICML, 2017.

[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. NAACL, 2016.

[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.

[4] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regres- sion network with context policy for phrase grounding. In ICCV, 2017.

[8] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017.

[10] R.Hu,M.Rohrbacnh,J.Andreas,T.Darrell,andK.Saenko. Modeling relationship in referential expressions with com- positional modular networks. In CVPR, 2017.

[11] R.Hu,H.Xu,M.Rohrbach,J.Feng,K.Saenko,andT.Dar- rell. Natural language object retrieval. CVPR, 2016.

[12] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and ex- ecuting programs for visual reasoning. ICCV, 2017.

[13] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.

[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dolla ́r, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014.

[16] J. Liu, L. Wang, and M.-H. Yang. Referring expression gen- eration and comprehension via attributes. In ICCV, 2017.

[18] R.LuoandG.Shakhnarovich.Comprehension-guidedrefer- ring expressions. CVPR, 2017.

[19] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. CVPR, 2016.

[20] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understand- ing. In ECCV, 2016.

[22] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by re- construction. In ECCV, 2016.

[26] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. CVPR, 2016.

[32] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV, 2016.

peter0749 commented 5 years ago

HackMD: https://hackmd.io/@a7LSlLVKSweNTyYDH7dYRw/BJjf2BfLB