LXMERT: Learning Cross-Modality Encoder Representations from Transformers

0. 論文

Journal/Conference: EMNLP 2019 Title: LXMERT: Learning Cross-Modality Encoder Representations from Transformers Authors: Hao Tan, Mohit Bansal URL: https://arxiv.org/abs/1908.07490

1. どんなもの？

言語と画像の2つを対応させたTransformer-basedのモデルを提案．言語を学習するエンコーダ，オブジェクトを学習するエンコーダ，この2つをCo-attentionによって学習するクロスモダリティ部分のエンコーダの3つで構成されている．

5つのタスクに対する事前学習を行い，VQAなどのデータで評価を行った所，高い精度

2. 先行研究と比べてどこがすごい？

画像とテキストを対応付けた，Transformer-based のモデルを提案した点

3. 技術や手法のキモはどこ？

これまで，画像やテキストの単一のフィールドで用いられてきた手法(Bertなど)の技術を拡張，応用して，適切に画像とテキストを対応させたTransformer-basedのモデルを提案

4. どうやって有効だと検証した？

3つのデータセットを用いて，既存の手法やAblationを行ったモデルと比較し，提案モデルの有効性や，どの事前学習が有効だったか？などについて調査

5. 議論はある？

6.次に読むべき論文は？

メモ

visualとtextの関係を学習するためのFramework：LVMERTの提案 3つのエンコードで構成されるtransformモデル：オブジェクト，言語，クロスモダリティ事前学習タスクを用いて画像と文章のペアを事前学習：maked language modeling, masked object prediction, image questionなど

1 Textとvisualの理解のたのめに，多くの単一モダリティによる研究が行われてきた Visual：Residualなど Text: BERTなど単一モダリティの研究が多く，資格と言語の研究はまだまだ →視覚と言語の相互作用を学習することに焦点を宛て、object relationship encoder, a language encoder, and a cross-modality encoderで構成 (1) masked cross-modality language modeling, (2) masked objectprediction via RoI-feature regression, (3) maskedobject prediction via detected-label classification,(4) cross-modality matching, and (5) image ques-tion answeringで事前学習

VQAとGQAでモデルの評価 GQA：Drew A Hudson and Christopher D Manning. 2019.Gqa: a new dataset for compositional question an-swering over real-world images. InProceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition NLVR2によるfine-tuneによる評価 NLVR2：Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai,and Yoav Artzi. 2019. A corpus for reasoning aboutnatural language grounded in photographs. InPro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics.

2 model architecture self-attentionとcross attentionによって構成

・Embedding word-level sentence embedding：word piece tokenizerを用いて埋め込みを作成 (BERT) object-level image embeddings：Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086の改良・Encoders (image, text) 各入力に対しSelf-attentionと同じシステムを導入→ bi-directional attentionで2つの特徴を考慮・Output [CLS]トークンを導入して，このtokenに対応するところをクロスモダリティの出力とする: text, image, cross-modalityの3つの出力

3 Pre-training ・Language: BERTと同様のマスク付きの部分の予測タスク・Vision: 物体をマスキングし，モデルに物体を推定する(RoI-Feature regression)と物体のラベルを学習 (Detected—label Classification) LabelはFaster-R-CNNによる出力を用いる：Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015.Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems,pages 91–99 ・Cross-Modality: 画像と文の一致を予測するタスク：Image Question answering

Pre-trainingのために,MS COCO, Visual Genome,VQA, GQA, VG-QAの5つのデータセットを用いて行う

4 Experiments VQA v2.0, GQA, NLVRの3つのデータセットで評価実験どのデータセットでも高い結果となった

5 Analysis ・各pre-training有効性についてAblation studyを通して解析・Data AugumentionよりもVQAにおいて提案手法による方法のほうが有効 7 Transfermerエンコーダと我々の新しいクロスモダリティエンコーダに基づいてモデルを構築 2つの画像QAデータセット（VQAとGQA）で最先端の結果を示し、NLVR2の難度の高い視覚推論データセットで22%の改善を示した

hkefka385 / paper_reading