This work propose a modular network for referring expression comprehension - Modular Attention Network (MAttNet) - that takes a natural language expression as input and softly decomposes it into three phrase embeddings (for subject, location, and relationship comprehension). These embeddings are used to trigger three separate visual modules which are finally combined into an overall region score based on the module weights.
There are three main novelties in MAttNet:
It is designed for general referring expressions.
It learns to parse expressions through a soft attention based mechanism.
It apply different visual attention techniques in the subject and relationship modules to allow relevant attention on the described image portions.
Related works
Referring Expression Comprehension
CNN-LSTM: [11, 18, 19, 20, 32]
Joint embedding model: [4, 16, 22, 26]
Modular Networks
VQA: [3]
Visual reasoning: [8, 12]
QA: [2]
Relationship modeling: [10]
Multitask reinforcement learning: [1]
Need external parser: [2, 3, 12]
End-to-end: [8, 10]
Most related work: [10]
Dataset
RefCOCO, RefCOCO+: [13]
RefCOCOg: [19]
MSCOCO: [14]
For "RefCOCO" and "RefCOCO+" there two different sets:
testA: Containing multiple "people"
testB: Containing multiple "objects"
For "RefCOCOg" there are two different sets too:
val*: Data is split by objects. And the same image could appear in both training and validation.
val: Data is split by images.
Most experiments run on RefCOCOg-val
Methodology
Model
Language Attention Network
We first embed each word $u_t$ into a vector $e_t$ using an one-hot word embedding, then a bidirectional LSTM-RNN is applied to encode the whole expression.
Given $H=\left{h{t}\right}{t=1}^{T}$ , we apply three trainable vectors $f_m$ where $m \in{\operatorname{subj}, \operatorname{loc}, \mathrm{rel}}$, computing the attention on each word for each module:
The weighted sum of word embeddings is used as the modular phrase embedding:
$$
q^{m}=\sum{t=1}^{T} a{m, t} e_{t}
$$
We compute 3 module weights for the expression, weighting how much each module contributes to the expression-object score.
We concatenate the first and last hidden vectors from $H$ which memorizes both structure and semantics of the whole expression, then use another fully-connected (FC) layer to transform it into 3 module weights:
$$
\left[w{s u b j}, w{l o c}, w{r e l}\right]=\operatorname{softmax}\left(W{m}^{T}\left[h{0}, h{T}\right]+b_{m}\right)
$$
Visual Modules
Region proposal: Faster R-CNN
Backbone: ResNet / VGG
Features: C3, C4 feature
(optional) Segmentation: Mask R-CNN
We compute the matching score for each candidate objects $o_i$ given each modular phrase embedding, i.e., $S(oi|q{subj}), S(oi|q{loc}), S(oi|q{rel})$.
Subject Module
Two tasks:
Attribute prediction
Phrase-guided attentional pooling
Attribute Prediction
Attributes are frequently used in referring expressions to differentiate between objects of the same category.
While preparing the attribute labels in the training set, we first run a template parser [13] to obtain color and generic attribute words.
A binary cross-entropy loss is used for multi-attribute classification:
$$
L{s u b j}^{a t t r}=\lambda{a t t r} \sum{i} \sum{j} w{j}^{a t t r}\left[\log \left(p{i j}\right)+\left(1-y{i j}\right) \log \left(1-p{i j}\right)\right]
$$
Where $w{j}^{a t t r}=1 / \sqrt{\mathrm{freq}{a t t r}}$
Note: $w_j^{attr}$ 用來平衡 label
Phrase-guided Attentional Pooling
We allow our subject module to localize relevant regions within a bounding box through “in-box” attention.
We use a 1×1 convolution to fuse attribute blob and C4 into a subject blob $V \in R^{d \times G}$, where G=14x14. Given the subject phrase embedding $q{subj}$, we compute its attention on each grid location:
$$
\begin{aligned} H{a} &=\tanh \left(W{v} V+W{q} q^{s u b j}\right) \ a^{v} &=\operatorname{softmax}\left(w{h, a}^{T} H{a}\right) \end{aligned}
$$
the final subject visual representation for the candidate region $oi$:
$$
\widetilde{v}{i}^{s u b j}=\sum{i=1}^{G} a{i}^{v} v_{i}
$$
Matching Function
We measure the similarity $S\left(o{i} | q^{s u b j}\right) = F\left(\tilde{v}{i}^{s u b j}, q^{s u b j}\right)$ between the subject representation $vi$ and phrase embedding $q{subj}$ using a matching function $F$ as shown in Fig. 3. The same matching function is used to compute the location score and relationship score: $S\left(o{i} | q^{l o c}\right)$ and $S\left(o{i} | q^{rel}\right)$ respectively.
Location matching score:
$$
S\left(o{i} | q^{l o c}\right)=F\left(\widetilde{l}{i}^{l o c}, q^{l o c}\right)
$$
Relationship Module
While the subject module deals with "in-box" details about the target object, some other expressions may involve its relationship with other "out-of-box" objects, e.g., "cat on chaise lounge". The relationship module is used to address these cases.
As in Fig. 5, given a candidate object oi we first look for its surrounding (up-to-five) objects $o_{ij}$ regardless of their categories.
We use the average-pooled C4 feature as the appearance feature $v_{ij}$ of each supporting object.
With appearance feature:
$$
\widetilde{v}{i j}^{r e l}=W{r}\left[v{i j} ; \delta m{i j}\right]+b_{r}
$$
Highest matching score as relationship score:
$$
S\left(o{i} | q^{r e l}\right)=\max {j \neq i} F\left(\widetilde{v}_{i j}^{r e l}, q^{r e l}\right)
$$
Loss Function
The overall weighted matching score for candidate object $oi$ and expression $r$ is:
$$
S\left(o{i} | r\right)=w{s u b j} S\left(o{i} | q^{s u b j}\right)+w{l o c} S\left(o{i} | q^{l o c}\right)+w{r e l} S\left(o{i} | q^{r e l}\right)
$$
We randomly sample two negative pairs $(o_i , r_j )$ and $(o_k , r_i )$,
where $r_j$ is the expression describing some other object and $ok$ is some other object in the same image.
$$
\begin{aligned} L{r a n k}=\sum{i} &\left[\lambda{1} \max \left(0, \Delta+S\left(o{i} | r{j}\right)-S\left(o{i} | r{i}\right)\right)\right.\ &\left.+\lambda{2} \max \left(0, \Delta+S\left(o{k} | r{i}\right)-S\left(o{i} | r_{i}\right)\right)\right] \end{aligned}
$$
Together:
$$
L=L{s u b j}^{a t t r}+L{r a n k}
$$
Result
Referring Expression Comprehension
Segmentation from Referring Expression
Conclusion
Our modular attention network addresses variance in referring expressions by attending to both relevant words and visual regions in a modular framework, and dynamically computing an overall matching score. We demonstrate our model’s effectiveness on bounding-box-level and pixel-level comprehension, significantly outperforming state-of-the-art.
[1] J. Andreas, D. Klein, and S. Levine. Modular multitask re- inforcement learning with policy sketches. ICML, 2017.
[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. NAACL, 2016.
[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.
[4] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regres- sion network with context policy for phrase grounding. In ICCV, 2017.
[8] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017.
[10] R.Hu,M.Rohrbacnh,J.Andreas,T.Darrell,andK.Saenko. Modeling relationship in referential expressions with com- positional modular networks. In CVPR, 2017.
[11] R.Hu,H.Xu,M.Rohrbach,J.Feng,K.Saenko,andT.Dar- rell. Natural language object retrieval. CVPR, 2016.
[12] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and ex- ecuting programs for visual reasoning. ICCV, 2017.
[13] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dolla ́r, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014.
[16] J. Liu, L. Wang, and M.-H. Yang. Referring expression gen- eration and comprehension via attributes. In ICCV, 2017.
[18] R.LuoandG.Shakhnarovich.Comprehension-guidedrefer- ring expressions. CVPR, 2017.
[19] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. CVPR, 2016.
[20] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understand- ing. In ECCV, 2016.
[22] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by re- construction. In ECCV, 2016.
[26] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. CVPR, 2016.
[32] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV, 2016.
MAttNet: Modular Attention Network for Referring Expression Comprehension
Original paper: MAttNet: Modular Attention Network for Referring Expression Comprehension
Journal/Conference: CVPR2018
Authors: Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L.Berg
Objective
This work propose a modular network for referring expression comprehension - Modular Attention Network (MAttNet) - that takes a natural language expression as input and softly decomposes it into three phrase embeddings (for subject, location, and relationship comprehension). These embeddings are used to trigger three separate visual modules which are finally combined into an overall region score based on the module weights.
There are three main novelties in MAttNet:
Related works
Dataset
For "RefCOCO" and "RefCOCO+" there two different sets:
For "RefCOCOg" there are two different sets too:
Methodology
Model
Language Attention Network
We first embed each word $u_t$ into a vector $e_t$ using an one-hot word embedding, then a bidirectional LSTM-RNN is applied to encode the whole expression.
$$ \begin{aligned} e{t} &=\operatorname{embedding}\left(u{t}\right) \ \overrightarrow{h}{t} &=\operatorname{LSTM}\left(e{t}, \overrightarrow{h}{t-1}\right) \ \overleftarrow{h}{t} &=\operatorname{LSTM}\left(e{t}, \overleftarrow{h}{t+1}\right) \ h{t} &=\left[\overrightarrow{h}{t}, \overleftarrow{h}_{t}\right] \end{aligned} $$
Given $H=\left{h{t}\right}{t=1}^{T}$ , we apply three trainable vectors $f_m$ where $m \in{\operatorname{subj}, \operatorname{loc}, \mathrm{rel}}$, computing the attention on each word for each module:
$$ a{m, t}=\frac{\exp \left(f{m}^{T} h{t}\right)}{\sum{k=1}^{T} \exp \left(f{m}^{T} h{k}\right)} $$
The weighted sum of word embeddings is used as the modular phrase embedding:
$$ q^{m}=\sum{t=1}^{T} a{m, t} e_{t} $$
We compute 3 module weights for the expression, weighting how much each module contributes to the expression-object score.
We concatenate the first and last hidden vectors from $H$ which memorizes both structure and semantics of the whole expression, then use another fully-connected (FC) layer to transform it into 3 module weights:
$$ \left[w{s u b j}, w{l o c}, w{r e l}\right]=\operatorname{softmax}\left(W{m}^{T}\left[h{0}, h{T}\right]+b_{m}\right) $$
Visual Modules
We compute the matching score for each candidate objects $o_i$ given each modular phrase embedding, i.e., $S(oi|q{subj}), S(oi|q{loc}), S(oi|q{rel})$.
Subject Module
Two tasks:
Attribute Prediction
Attributes are frequently used in referring expressions to differentiate between objects of the same category.
While preparing the attribute labels in the training set, we first run a template parser [13] to obtain color and generic attribute words.
A binary cross-entropy loss is used for multi-attribute classification:
$$ L{s u b j}^{a t t r}=\lambda{a t t r} \sum{i} \sum{j} w{j}^{a t t r}\left[\log \left(p{i j}\right)+\left(1-y{i j}\right) \log \left(1-p{i j}\right)\right] $$ Where $w{j}^{a t t r}=1 / \sqrt{\mathrm{freq}{a t t r}}$
Phrase-guided Attentional Pooling
We allow our subject module to localize relevant regions within a bounding box through “in-box” attention.
We use a 1×1 convolution to fuse attribute blob and C4 into a subject blob $V \in R^{d \times G}$, where G=14x14. Given the subject phrase embedding $q{subj}$, we compute its attention on each grid location: $$ \begin{aligned} H{a} &=\tanh \left(W{v} V+W{q} q^{s u b j}\right) \ a^{v} &=\operatorname{softmax}\left(w{h, a}^{T} H{a}\right) \end{aligned} $$
the final subject visual representation for the candidate region $oi$: $$ \widetilde{v}{i}^{s u b j}=\sum{i=1}^{G} a{i}^{v} v_{i} $$
Matching Function
We measure the similarity $S\left(o{i} | q^{s u b j}\right) = F\left(\tilde{v}{i}^{s u b j}, q^{s u b j}\right)$ between the subject representation $vi$ and phrase embedding $q{subj}$ using a matching function $F$ as shown in Fig. 3. The same matching function is used to compute the location score and relationship score: $S\left(o{i} | q^{l o c}\right)$ and $S\left(o{i} | q^{rel}\right)$ respectively.
Location Module
Absolute location representation: $$ l{i}=\left[\frac{x{t l}}{W}, \frac{y{t l}}{H}, \frac{x{b r}}{W}, \frac{y_{b r}}{H}, \frac{w \cdot h}{W \cdot H}\right] $$
Relative location representation (same category): $$ \delta l{i j}=\left[\frac{\left[\Delta x{t l}\right]{i j}}{w{i}}, \frac{\left[\triangle y{t l}\right]{i j}}{h{i}}, \frac{\left[\triangle x{b r}\right]{i j}}{w{i}}, \frac{\left[\triangle y{b r}\right]{i j}}{h{i}}, \frac{w{j} h{j}}{w{i} h_{i}}\right] $$
Together: $$ \widetilde{l}{i}^{\text {loc}}=W{l}\left[l{i} ; \delta l{i}\right]+b_{l} $$
Location matching score: $$ S\left(o{i} | q^{l o c}\right)=F\left(\widetilde{l}{i}^{l o c}, q^{l o c}\right) $$
Relationship Module
While the subject module deals with "in-box" details about the target object, some other expressions may involve its relationship with other "out-of-box" objects, e.g., "cat on chaise lounge". The relationship module is used to address these cases.
As in Fig. 5, given a candidate object oi we first look for its surrounding (up-to-five) objects $o_{ij}$ regardless of their categories.
We use the average-pooled C4 feature as the appearance feature $v_{ij}$ of each supporting object.
Relative location difference: $$ \delta m{i j}=\left[\frac{\left[\Delta x{t l}\right]{i j}}{w{i}}, \frac{\left[\triangle y{t l}\right]{i j}}{h{i}}, \frac{\left[\triangle x{b r}\right]{i j}}{w{i}}, \frac{\left[\triangle y{b r}\right]{i j}}{h{i}}, \frac{w{j} h{j}}{w{i} h_{i}}\right] $$
With appearance feature: $$ \widetilde{v}{i j}^{r e l}=W{r}\left[v{i j} ; \delta m{i j}\right]+b_{r} $$
Highest matching score as relationship score: $$ S\left(o{i} | q^{r e l}\right)=\max {j \neq i} F\left(\widetilde{v}_{i j}^{r e l}, q^{r e l}\right) $$
Loss Function
The overall weighted matching score for candidate object $oi$ and expression $r$ is: $$ S\left(o{i} | r\right)=w{s u b j} S\left(o{i} | q^{s u b j}\right)+w{l o c} S\left(o{i} | q^{l o c}\right)+w{r e l} S\left(o{i} | q^{r e l}\right) $$
We randomly sample two negative pairs $(o_i , r_j )$ and $(o_k , r_i )$, where $r_j$ is the expression describing some other object and $ok$ is some other object in the same image. $$ \begin{aligned} L{r a n k}=\sum{i} &\left[\lambda{1} \max \left(0, \Delta+S\left(o{i} | r{j}\right)-S\left(o{i} | r{i}\right)\right)\right.\ &\left.+\lambda{2} \max \left(0, \Delta+S\left(o{k} | r{i}\right)-S\left(o{i} | r_{i}\right)\right)\right] \end{aligned} $$
Together: $$ L=L{s u b j}^{a t t r}+L{r a n k} $$
Result
Referring Expression Comprehension
Segmentation from Referring Expression
Conclusion
Our modular attention network addresses variance in referring expressions by attending to both relevant words and visual regions in a modular framework, and dynamically computing an overall matching score. We demonstrate our model’s effectiveness on bounding-box-level and pixel-level comprehension, significantly outperforming state-of-the-art.
Thoughts
Link for code/model/dataset
Demo site:
http://vision2.cs.unc.edu/refer/comprehension
Source code for training/evaluation:
https://github.com/lichengunc/MAttNet
Datasets (RefCOCO, RefCOCO+, RefCOCOg): https://github.com/lichengunc/refer
References
[1] J. Andreas, D. Klein, and S. Levine. Modular multitask re- inforcement learning with policy sketches. ICML, 2017.
[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. NAACL, 2016.
[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.
[4] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regres- sion network with context policy for phrase grounding. In ICCV, 2017.
[8] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017.
[10] R.Hu,M.Rohrbacnh,J.Andreas,T.Darrell,andK.Saenko. Modeling relationship in referential expressions with com- positional modular networks. In CVPR, 2017.
[11] R.Hu,H.Xu,M.Rohrbach,J.Feng,K.Saenko,andT.Dar- rell. Natural language object retrieval. CVPR, 2016.
[12] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and ex- ecuting programs for visual reasoning. ICCV, 2017.
[13] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dolla ́r, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In ECCV, 2014.
[16] J. Liu, L. Wang, and M.-H. Yang. Referring expression gen- eration and comprehension via attributes. In ICCV, 2017.
[18] R.LuoandG.Shakhnarovich.Comprehension-guidedrefer- ring expressions. CVPR, 2017.
[19] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. CVPR, 2016.
[20] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understand- ing. In ECCV, 2016.
[22] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by re- construction. In ECCV, 2016.
[26] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. CVPR, 2016.
[32] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV, 2016.