Graph-Structured Referring Expression Reasoning in the Wild

0. 論文

Journal/Conference: CVPR 2020 Title: Graph-Structured Referring Expression Reasoning in the Wild Authors: Sibei Yang, Guanbin Li, Yizhou Yu URL: https://openaccess.thecvf.com/content_CVPR_2020/html/Yang_Graph-Structured_Referring_Expression_Reasoning_in_the_Wild_CVPR_2020_paper.html

1. どんなもの？

画像とテキストのグラフ表現に対して推論を行うscen graph guided modular network (SGMN)を提案また，既存データセットのサンプルの難易度が不均衡といった問題に対処するために新たなデータセットRef-reasoning datasetを提案

2. 先行研究と比べてどこがすごい？

3. 技術や手法のキモはどこ？

4. どうやって有効だと検証した？

5. 議論はある？

6.次に読むべき論文は？

メモ

Abst 説明文と画像の構造化 (位置合わせの必要性)と複雑な言語理解の必要性言語表現に基づいて意味グラフとシーングラフ上で推論を行うScene graph guided modular network (SGMN)を提案構造化された参照表現推論のためにデータセット：Ref-reasoning datasetの提案

1 Intro 言及されるobject：referent referring expression：対ー構造となる既存の研究：参照表現と視覚的内容のマッチングスコアを学習 (言語の構造を無視して，画像中の対象物との全体的なマッチングを参集t) Ruotian Luo and Gregory Shakhnarovich. Comprehension-guided referring expressions. InProceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 7102–7111, 2017 Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Antonvan den Hengel. Parallel attention: A unified framework forvisual object discovery through dialogs and queries. InPro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4252–4261, 2018 Sibei Yang, Guanbin Li, and Yizhou Yu. Cross-modal rela-tionship inference for grounding referring expressions. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4145–4154, 2019

(syntactic informatioinを無視して)Self - Attentionを用いて言語構造を探索 R. Hong, D. Liu, X. Mo, X. He, and H. Zhang. Learningto compose and reason with language tree structures for vi-sual grounding.IEEE Transactions on Pattern Analysis andMachine Intelligence, pages 1–1, 2019 DAG構造の利用︰Sibei Yang, Guanbin Li, and Yizhou Yu. Dynamic graphattention for referring expression comprehension. InPro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2019

→ 言語情報を解析し，言語構造を手がかりとする推論モデル：Scene Graph guided modular network (SGMN)を提案・入力画像を有向グラフ (image) ・言語シーングラフ(language scene graph)に解析 (text) (ref. 16, 21) Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and FanglinWang. Referring expression grounding by marginalizingscene graph likelihood, 2019. ・グラフAttention mechanismを用いた推論 Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable andexplicit visual reasoning over scene graphs. InProceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 8376–8384, 2019. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and DanKlein. Neural module networks. InProceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 39–48, 2016

データセットも重要既存のデータセットの問題・サンプルの難易度が不均衡・中間推論プロセスに対しては行われない (解釈可能なモデルを促進できない) →以下の研究では限界に対処するため単純な三次元形状のデータセットを提案してるが現実のシーンに一般化出来ない untao Liu, Chenxi Liu, Yutong Bai, and Alan L Yuille.Clevr-ref+: Diagnosing visual reasoning with referring ex-pressions. InProceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 4185–4194,2019 →本研究のデータセットに関して：・Ref-Reasoningという大規模なデータセットを構築・画像のシーングラフに意味的に豊かな表現を生成 (テンプレートの豊富さなどから) ・中間ステップで自動的にannotatioinを取得？・一様サンプリングを用いてデータセットのバランスにも対処

本研究の貢献・semantic graph and scene graphを用いて推論を行うモデルをテア・Ref-Reasoningを構築しsemantically rich expressionが多く含まれている

2 Related work 2.1 Grounding Referring expression Subject - relation- objectの三項で意味成分を分析 Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in ref- erential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1115–1124, 2017 Subject - location - relationの関係を学習 Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular atten- tion network for referring expression comprehension. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315, 2018.

2.2 Dataset Bias and Solutions Grounding referring expressionに関するデータセットバイアスの議論データセットの偏りにより浅い相関のみを学習してします：Volkan Cirik, Louis-Philippe Morency, and Taylor Berg- Kirkpatrick. Visual referring expression recognition: What do systems actually learn? arXiv preprint arXiv:1805.11818, 2018 データセットの偏りに対処するためにCLEVR-Refというデータセットを提案．ただし単純な3次元形状のシーンでIっパン化でいない：Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L Yuille. Clevr-ref+: Diagnosing visual reasoning with referring ex- pressions. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 4185–4194, 2019 → 本研究ではVisual GenomeデータセットのアノテーションとGQAデータセットのアノテーションを用いることで新たなデータセットを構築

3 Approach：SGMN (scene guided modular network)

3.1 Scene Graph Representations Scen graphとlanguage scene graphの2種類のグラフそれぞれノードとエッジが対応するという構造

3.1.1 Image Semantic Graph (O) Image のobjectがノードとなり，それらの関係をedgeで表現 →特徴についてはobjectをCNNで表現を獲得し，bounding boxなどが特徴となる

3.1.2 Language Scene graph (S) 名詞や名詞句がノードとなり，それらの関係 (前置詞や動詞) をedgeとして表現

3.2 Structured reasoning グラフのノードとエッジの推論順序と推論規則の設計 Each node︰AttendNode module Each edge：AttendNode, Attend Relation, Transferなど用いて推論

3.2.1 Reasoning process 推論順序に関する説明 language scene graphに基づいてImage semanticg gprahのノードに対するAttention mapを学習対象としているノードに対してdegreeが向いていないノードを対象 (zero out-degree node) 3.2.2 Neural Modules ・・・よく分からない 3.3 Loss Function

hkefka385 / paper_reading