CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

HackMD version: https://hackmd.io/@85banmo0Q-SUjK3q1mdbQg/Bk3EEtVUB

Original paper: CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions Supplementary Material

Journal/Conference: CVPR2019 Poster

Authors: Runtao Liu, Chenxi Liu2, Yutong Bai, Alan Yuille

Objective

Since current referring expression datasets, used for reference object detection and reference image segmentation, suffer from bias and most of state-of-the-art models cannot be easily evaluated on intermediate reasoning process, they

build CLEVR-Ref+, a synthetic, diagnostic dataset to mininize bias and ground truth visual reasoning process.
propose IEP-Ref, a module network approach, and find
- the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step.
- even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions.

Related works

Referring Expression
- focus on detection: [34]
- focus on segmentation: [12, 20, 21, 26]
Dataset Bias and Diagnostic Datasets
- referring expression datasets: [24, 18, 35]
- report the problem: [3]reported that the performance when discarding the referring expression and basing solely on the image is significantly higher than random
- synthetic datasets: [15] for VQA, instead of introducing extensions, evaluating state-of-the-art models, and directly facilitating the diagnosis of visual reasoning.

CLEVR-Ref+ Dataset

use same scenes as CLEVR to generate 10 referring expressions for every image and each referring expression may refer to one or more objects in the scene.

generation

change the questions to referring expressions
change the answers to referred objects with a bounding box or segmentation mask as the output.
add ordinal and visible modules
- “ordinal” (e.g. “The second woman from left”)
- “visible” (e.g. “The barely seen backpack”)
  procedure
  1. Randomly choose a referring expression family.
  2. Randomly choose a text template from this family.
  3. Follow the functional program and select random values when encountering template parameters3.
  4. Reject when certain criteria fail, that is, the sampled referring expression is inappropriate for the given scene; return when the entire functional program follows through.

Methodology

Model

Ief-ref

reveal intermidate reasoning

Result

Overall Evaluation

Detection models are evaluated by accuracy (i.e. whether the prediction selects the correct bounding box among given candidates)
Segmentation models are evaluated by Intersection over Union (IoU)
The overall result shows that MAttNet and IEP-Ref performs much better, which suggests the importance to model compositionality within the referring expression.
Basic Referring Ability for Different Types of Module

referring by direct description of object attributes
It seems that ordinality is the hardest concept to learn.

Step-By-Step Inspection of Visual Reasoning

the first direct and quantitative proof that neural modules behave as intended.

Conclusion

build the CLEVR-Ref+ dataset which complements existing ones for referring expressions.
evaluate state-of-the-art referring expression models.
propose IEP-Ref, which uses a module network approach and outperforms competing methods by a large margin
results shows that the neural modules work as expected.

Thoughts

Evaluate existing State-of-the-art method + newly proposed IEP-Ref with CLEVR-Ref +.
The proposed method is useful for analyzing the inference process.
Synthetic dataset: If synthetic data isn’t nearly identical to a real-world data set, it can compromise the quality of decision-making that is being done based on the data.

Link for code/model/dataset

Code for CLEVR-Ref+ Dataset Generation https://github.com/ccvl/clevr-refplus-dataset-gen

Code for IEP-Ref Model: https://github.com/ccvl/iep-ref

Dataset: CLEVR-Ref+-v1.0 Dataset (16GB) CLEVR-Ref+-CoGenT-v1.0 Dataset (19GB)

References

[3] V. Cirik, L. Morency, and T. Berg-Kirkpatrick. Visual refer- ring expression recognition: What do systems actually learn? In NAACL-HLT (2), pages 781–787. Association for Compu- tational Linguistics, 2018. 1, 3 [12] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 108–124. Springer, 2016. 2 [15] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. [18] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798. ACL, 2014. 1, 3 [20] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia. Referring image segmentation via recurrent refinement net- works. In CVPR, pages 5745–5753. IEEE Computer Society, 2018. 2 [21] C.Liu,Z.Lin,X.Shen,J.Yang,X.Lu,andA.L.Yuille.Re- current multimodal interaction for referring image segmen- tation. In ICCV, pages 1280–1289. IEEE Computer Society, 2017. 2, 5 [24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20. IEEE Computer Society, 2016. 1, 2, 3, 5 [26] E. Margffoy-Tuay, J. C. Pe ́rez, E. Botero, and P. Arbela ́ez. Dynamic multimodal instance segmentation guided by natu- ral language queries. In ECCV (11), volume 11215 of Lec- ture Notes in Computer Science, pages 656–672. Springer, 2018. 2 [34] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring ex- pression comprehension. In CVPR. IEEE Computer Society, 2018. 2, 5 [35] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 69–85. Springer, 2016. 1, 2, 3, 4, 5

peter0749 / object-referring