peter0749 / object-referring

1 stars 0 forks source link

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions #2

Open ghost opened 5 years ago

ghost commented 5 years ago

HackMD version: https://hackmd.io/@85banmo0Q-SUjK3q1mdbQg/Bk3EEtVUB

Original paper: CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions Supplementary Material

Journal/Conference: CVPR2019 Poster

Authors: Runtao Liu, Chenxi Liu2, Yutong Bai, Alan Yuille

Objective

Since current referring expression datasets, used for reference object detection and reference image segmentation, suffer from bias and most of state-of-the-art models cannot be easily evaluated on intermediate reasoning process, they

  1. build CLEVR-Ref+, a synthetic, diagnostic dataset to mininize bias and ground truth visual reasoning process.
  2. propose IEP-Ref, a module network approach, and find
    • the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step.
    • even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions.

Related works

CLEVR-Ref+ Dataset

use same scenes as CLEVR to generate 10 referring expressions for every image and each referring expression may refer to one or more objects in the scene.

generation

Methodology

Model

Ief-ref

reveal intermidate reasoning

highly-related Inferring and Executing Programs for Visual Reasoning

Result

Overall Evaluation

Step-By-Step Inspection of Visual Reasoning

Conclusion

  1. build the CLEVR-Ref+ dataset which complements existing ones for referring expressions.
  2. evaluate state-of-the-art referring expression models.
  3. propose IEP-Ref, which uses a module network approach and outperforms competing methods by a large margin
  4. results shows that the neural modules work as expected.

Thoughts

  1. Evaluate existing State-of-the-art method + newly proposed IEP-Ref with CLEVR-Ref +.
  2. The proposed method is useful for analyzing the inference process.
  3. Synthetic dataset: If synthetic data isn’t nearly identical to a real-world data set, it can compromise the quality of decision-making that is being done based on the data.

Link for code/model/dataset

Code for CLEVR-Ref+ Dataset Generation https://github.com/ccvl/clevr-refplus-dataset-gen

Code for IEP-Ref Model: https://github.com/ccvl/iep-ref

Dataset: CLEVR-Ref+-v1.0 Dataset (16GB) CLEVR-Ref+-CoGenT-v1.0 Dataset (19GB)


References

[3] V. Cirik, L. Morency, and T. Berg-Kirkpatrick. Visual refer- ring expression recognition: What do systems actually learn? In NAACL-HLT (2), pages 781–787. Association for Compu- tational Linguistics, 2018. 1, 3 [12] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 108–124. Springer, 2016. 2 [15] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. [18] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798. ACL, 2014. 1, 3 [20] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia. Referring image segmentation via recurrent refinement net- works. In CVPR, pages 5745–5753. IEEE Computer Society, 2018. 2 [21] C.Liu,Z.Lin,X.Shen,J.Yang,X.Lu,andA.L.Yuille.Re- current multimodal interaction for referring image segmen- tation. In ICCV, pages 1280–1289. IEEE Computer Society, 2017. 2, 5 [24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20. IEEE Computer Society, 2016. 1, 2, 3, 5 [26] E. Margffoy-Tuay, J. C. Pe ́rez, E. Botero, and P. Arbela ́ez. Dynamic multimodal instance segmentation guided by natu- ral language queries. In ECCV (11), volume 11215 of Lec- ture Notes in Computer Science, pages 656–672. Springer, 2018. 2 [34] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring ex- pression comprehension. In CVPR. IEEE Computer Society, 2018. 2, 5 [35] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In ECCV (2), volume 9906 of Lecture Notes in Computer Science, pages 69–85. Springer, 2016. 1, 2, 3, 4, 5