Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

0. 論文

Journal/Conference: CVPR 2020 Title: Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs Authors: Shizhe Chen, Qin Jin, Peng Wang, Qi Wu URL: https://openaccess.thecvf.com/content_CVPR_2020/html/Chen_Say_As_You_Wish_Fine-Grained_Control_of_Image_Caption_Generation_CVPR_2020_paper.html

1. どんなもの？

画像からのキャプション生成のタスクにおいて，画像をASG (Abstrtact Scene Graph)に置き換え生成することにより柔軟なキャプション生成が可能にオブジェクトの属性などを柔軟に変更可能なキャプション生成が可能に

2. 先行研究と比べてどこがすごい？

画像をより抽象的なグラフに置き換えることにより，制御可能なキャプション生成ができるようになった点

3. 技術や手法のキモはどこ？

attributeというノードのおかげで表現の細かさなどを柔軟に表現が可能に Graph情報をcontextとflowの文脈から考慮するための手法を提案：特にGraphの隣接を考慮したAttentionの算出方法 (e.q.8, 9) などが新規モデルの全体像

4. どうやって有効だと検証した？

5. 議論はある？

グラフを変更させることにより柔軟なキャプション生成が可能となり多様性が増加

6.次に読むべき論文は？

メモ

Abst ユーザの意図に応じた記述の生成は難しいユーザの意図を細かいレベルで表現するためにthe Abstract Scene Grap (ASG)を提案 ASG / abstract nodes (object, attribute, relationship)からなる有向グラフを用いてCaptioningを生成 ASGを用いることで制御可能なキャプションが可能に

1 Introduction 人間は画像の内容を粗いから細かさの程度を調整する受動的なキャプション生成は多様性が少なく凡庸な記述を生成： Qingzhong Wang and Antoni B Chan. Describing like humans: on diversity in image captioning. InProceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 4195–4203, 2019 キャプションを生成するためにASGを用いて生成

表現スタイルを制御することに焦点をあてたキャプションを生成 Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and LiDeng. Stylenet: Generating attractive visual captions withstyles. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3137–3146, 2017. Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Han-qing Lu. Mscap: Multi-style image captioning with un-paired stylized text. InProceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4204–4213, 2019 Alexander Mathews, Lexing Xie, and Xuming He. Semstyle:Learning to generate stylised image captions using unalignedtext. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 8591–8600, 2018

記述内容を制御するキャプションを生成画像領域：Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap:Fully convolutional localization networks for dense caption-ing. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4565–4574, 2016 オブジェクト： Yue Zheng, Yali Li, and Shengjin Wang. Intention orientedimage captions with guiding objects. InProceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, June 2019 Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara.Show, control and tell: A framework for generating control-lable and grounded captions. InProceedings of the IEEEConference on Computer Vision and Pattern Recognition,June 2019 品詞タグ Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G.Schwing, and David Forsyth. Fast, diverse and accurate im-age captioning guided by part-of-speech. InProceedingsof the IEEE Conference on Computer Vision and PatternRecognition, June 2019

問題：1つのラベルや画像領域の集合などのみを満ちいて制御するのは限界がある本研究：制御可能な画像キャプション生成のためにASG (object, attribute, relationshipによって記述されるグラフ) を提案 / AGSを用いてキャプション生成 (AGS2Caption) ・AGSに意味的なラベルは存在しないのでノードの役割認識のためのエンコーダを提案・ノードがどのように接続されているかによって記述順序を決定

貢献・グラフを用いて詳細なレベルで制御可能なキャプション生成手法の提案・role-aware graph encoderとlanguage decoder for graphsによってキャプションの生成を行う

2 Related Works 2.1 Image Captioning Attentive image captioning：画像の関連部分を用いて単語を動的に生成する手法 eter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. InProceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages6077–6086, 2018. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.Knowing when to look: Adaptive attention via a visual sen-tinel for image captioning. InProceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages375–383, 2017. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, AaronCourville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image caption gen-eration with visual attention. InProceedings of the Interna-tional Conference on Machine Learning, pages 2048–2057,2015

メトリックとのミスマッチを低減するために強化学習を用いて最適化 Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, andKevin Murphy. Improved image captioning via policy gradient optimization of spider. InProceedings of the IEEE in-ternational conference on computer vision, pages 873–881,2017 Steven J Rennie, Etienne Marcheret, Youssef Mroueh, JerretRoss, and Vaibhava Goel. Self-critical sequence training forimage captioning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 7008–7024, 2017

意味概念の採用 Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, andAnton Van Den Hengel. What value do explicit high levelconcepts have in vision to language problems? InProceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 203–212, 201 Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, KennethTran, Jianfeng Gao, Lawrence Carin, and Li Deng. Seman-tic compositional networks for visual captioning. InPro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 5630–5639, 2017

外部情報の用いる Harsh Agrawal, Karan Desai, Xinlei Chen, Rishabh Jain,Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson.nocaps: novel object captioning at scale. InProceedingsof the IEEE International Conference on Computer Vision,2019\ Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.Neural baby talk. InProceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 7219–7228, 2018

シーングラフの採用 Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 10685–10694, 2019. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision, pages 684–699, 2018

2.2 Controllable Image Caption Generation Introductionの記述と同じ本研究では，associative distributesがどれだけ使われるべきか？ / 他のオブジェクトを含めるべきか？ / 記述の順序はどうするべきか？といった制御について行ったキャプション生成を行う

3 Abstract Scene Graph (ASG) ノード：object node o, attribute node a, relation node r l：objectに付与されるattributeの数 Graph Gでオブジェクト間の関係を自動的に捉える AGSの自動生成の方法はキャプション生成を行う (Appendixにあり)

4 The ASG2Capton model 与えられたASG Gに一致した文章を生成

4.1 Role-aware Graph Encoder ASG G→ embeddingに変換 2つのシステムで構成 role-aware node embedding：ノードの役割をを理解するため MR-GCN：contextual encodingを考慮するための (MR-GCN: Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, RianneVan Den Berg, Ivan Titov, and Max Welling. Modeling rela-tional data with graph convolutional networks. InEuropeanSemantic Web Conference, pages 593–607. Springer, 2018)

・Role-aware Node Embedding：ノード情報をembedding ノードに対応するvisual featuresを用いてrole-aware node embedding xを算出 (relationshipなら2つのobjectの結合abounding box)

・Multi-relational Graph Convolutional Network: 周辺ノードを用いてノード情報を更新グラフのコンテキスト情報を捉えるためにMR-GCNを採用近隣のノードの情報を用いて学習

4.2 Language Decoder for Graphs 符号化されたGをキャプションに変換グラフの意味と構造を考慮するためのAttention機構内容が記述されてるかどうかを記録するグラフの更新機構

・Overview of the Decoder Bottom-up and Top-downと同様に2層LSTM (attention + Language)を用いてDecoderを構成 graph-basedのattetionを用いてcontext vector zを取得し Language LSTMでノード更新用の出力(潜在)と次の文字を出力

・Graph-based Attention Mecahnism 意味とグラフ構造をこうりょするための2つのAttention：graph context + graph flow attention content attention：ノード間の接続は無視した通常のAttention graph flow attention：通常のAGSと異なりflow basedのAGSは開始記号が割り当てられ，オブジェクトと属性ノードが双方向に影響しあっているといった特徴 → この2つのattentionの重みを用いて計算

・Graph Updating Mechanism Graphの表現を更新するためにAttentionを適応的に変化させるためのvisual sentinel gateを導入 Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.Knowing when to look: Adaptive attention via a visual sen-tinel for image captioning. InProceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages375–383, 2017

4.3 croissants entropyを損失関数として学習

Figure 5 Graphの構造を変化させることでテキスト表現も変化

hkefka385 / paper_reading