Authors:
Runtao Liu, Chenxi Liu2, Yutong Bai, Alan Yuille
Objective
Since current referring expression datasets, used for reference object detection and reference image segmentation, suffer from bias and most of state-of-the-art models cannot be easily evaluated on intermediate reasoning process, they
build CLEVR-Ref+, a synthetic, diagnostic dataset to mininize bias and ground truth visual reasoning process.
propose IEP-Ref, a module network approach, and find
the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step.
even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions.
Related works
Referring Expression
focus on detection: [34]
focus on segmentation: [12, 20, 21, 26]
Dataset Bias and Diagnostic Datasets
referring expression datasets: [24, 18, 35]
report the problem: [3]reported that the performance when discarding the referring expression and basing solely on the image is significantly higher than random
synthetic datasets: [15] for VQA, instead of introducing extensions, evaluating state-of-the-art models, and directly facilitating the diagnosis of visual reasoning.
CLEVR-Ref+ Dataset
use same scenes as CLEVR to generate 10 referring expressions for every image and each referring expression may refer to one or more objects in the scene.
generation
change the questions to referring expressions
change the answers to referred objects with a bounding box or segmentation mask as the output.
add ordinal and visible modules
“ordinal” (e.g. “The second woman from left”)
“visible” (e.g. “The barely seen backpack”)
procedure
Randomly choose a referring expression family.
Randomly choose a text template from this family.
Follow the functional program and select random values when encountering template parameters3.
Reject when certain criteria fail, that is, the sampled referring expression is inappropriate for the given scene; return when the entire functional program follows through.
Detection models are evaluated by accuracy (i.e. whether the prediction selects the correct bounding box among given candidates)
Segmentation models are evaluated by Intersection over Union (IoU)
The overall result shows that MAttNet and IEP-Ref performs much better, which suggests the importance to model compositionality within the referring expression.
Basic Referring Ability for Different Types of Module
referring by direct description of object attributes
It seems that ordinality is the hardest concept to learn.
Step-By-Step Inspection of Visual Reasoning
the first direct and quantitative proof that neural modules behave as intended.
Conclusion
build the CLEVR-Ref+ dataset which complements existing ones for referring expressions.
The proposed method is useful for analyzing the inference process.
Synthetic dataset: If synthetic data isn’t nearly identical to a real-world data set, it can compromise the quality of decision-making that is being done based on the data.
[3] V. Cirik, L. Morency, and T. Berg-Kirkpatrick. Visual refer- ring expression recognition: What do systems actually learn? In NAACL-HLT (2), pages 781–787. Association for Compu- tational Linguistics, 2018. 1, 3
[12] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 108–124. Springer, 2016. 2
[15] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.
[18] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798. ACL, 2014. 1, 3
[20] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia. Referring image segmentation via recurrent refinement net- works. In CVPR, pages 5745–5753. IEEE Computer Society, 2018. 2
[21] C.Liu,Z.Lin,X.Shen,J.Yang,X.Lu,andA.L.Yuille.Re- current multimodal interaction for referring image segmen- tation. In ICCV, pages 1280–1289. IEEE Computer Society, 2017. 2, 5
[24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20. IEEE Computer Society, 2016. 1, 2, 3, 5
[26] E. Margffoy-Tuay, J. C. Pe ́rez, E. Botero, and P. Arbela ́ez. Dynamic multimodal instance segmentation guided by natu- ral language queries. In ECCV (11), volume 11215 of Lec- ture Notes in Computer Science, pages 656–672. Springer, 2018. 2
[34] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L.
Berg. Mattnet: Modular attention network for referring ex- pression comprehension. In CVPR. IEEE Computer Society, 2018. 2, 5
[35] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 69–85. Springer, 2016. 1, 2, 3, 4, 5
HackMD version: https://hackmd.io/@85banmo0Q-SUjK3q1mdbQg/Bk3EEtVUB
Original paper: CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions Supplementary Material
Journal/Conference: CVPR2019 Poster
Authors: Runtao Liu, Chenxi Liu2, Yutong Bai, Alan Yuille
Objective
Since current referring expression datasets, used for reference object detection and reference image segmentation, suffer from bias and most of state-of-the-art models cannot be easily evaluated on intermediate reasoning process, they
Related works
CLEVR-Ref+ Dataset
use same scenes as CLEVR to generate 10 referring expressions for every image and each referring expression may refer to one or more objects in the scene.
generation
procedure
Methodology
Model
Ief-ref
reveal intermidate reasoning
highly-related Inferring and Executing Programs for Visual Reasoning
Result
Overall Evaluation
Basic Referring Ability for Different Types of Module
referring by direct description of object attributes
Step-By-Step Inspection of Visual Reasoning
Conclusion
Thoughts
Link for code/model/dataset
Code for CLEVR-Ref+ Dataset Generation https://github.com/ccvl/clevr-refplus-dataset-gen
Code for IEP-Ref Model: https://github.com/ccvl/iep-ref
Dataset: CLEVR-Ref+-v1.0 Dataset (16GB) CLEVR-Ref+-CoGenT-v1.0 Dataset (19GB)
References
[3] V. Cirik, L. Morency, and T. Berg-Kirkpatrick. Visual refer- ring expression recognition: What do systems actually learn? In NAACL-HLT (2), pages 781–787. Association for Compu- tational Linguistics, 2018. 1, 3 [12] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 108–124. Springer, 2016. 2 [15] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. [18] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798. ACL, 2014. 1, 3 [20] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia. Referring image segmentation via recurrent refinement net- works. In CVPR, pages 5745–5753. IEEE Computer Society, 2018. 2 [21] C.Liu,Z.Lin,X.Shen,J.Yang,X.Lu,andA.L.Yuille.Re- current multimodal interaction for referring image segmen- tation. In ICCV, pages 1280–1289. IEEE Computer Society, 2017. 2, 5 [24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20. IEEE Computer Society, 2016. 1, 2, 3, 5 [26] E. Margffoy-Tuay, J. C. Pe ́rez, E. Botero, and P. Arbela ́ez. Dynamic multimodal instance segmentation guided by natu- ral language queries. In ECCV (11), volume 11215 of Lec- ture Notes in Computer Science, pages 656–672. Springer, 2018. 2 [34] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring ex- pression comprehension. In CVPR. IEEE Computer Society, 2018. 2, 5 [35] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 69–85. Springer, 2016. 1, 2, 3, 4, 5