Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

0. 論文

Journal/Conference: CVPR 2020 Title: Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA Authors: Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach URL: https://openaccess.thecvf.com/content_CVPR_2020/html/Hu_Iterative_Answer_Prediction_With_Pointer-Augmented_Multimodal_Transformers_for_TextVQA_CVPR_2020_paper.html

1. どんなもの？

画像中のテキストを使ってVQAを解くタスクを，画像と質問から特徴を抽出するモジュールとdynamic pointer networkを用いて解答を生成

2. 先行研究と比べてどこがすごい？

テキストと画像の特徴を同じ空間に埋め込むことで，異なるモジュール間の関係を捉えた点

3. 技術や手法のキモはどこ？

画像内のテキストとその周辺の空間情報も用い，画像から読み取りにくいテキストも解釈した点基本的にはTransformerとCopy Meshモデルの組み合わせ

4. どうやって有効だと検証した？

5. 議論はある？

画像の文字情報を読み取り，QAに適切に回答できている例

6.次に読むべき論文は？

・dynamic pointer network Deepcopy: Grounded response genera-tion with hierarchical pointer networks ・使用されているOCRシステム Roaster OCR system: Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar.Rosetta: Large scale system for text detection and recogni-tion in images. InProceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining, pages 71–79. ACM, 2018

メモ

フェイクニュースのモデルをマルチモーダル問題として捉えた場合，まだまだ各モーダルの融合は不十分 Deepcopy: Grounded response genera-tion with hierarchical pointer networks

Abst 画像中のテキストを読んで理解するTextVQAタスクの研究基本的なモジュール：2つのモダリティのペア間のcutom pairwise fusion機能に基づいて予測 →異なるモダリティを共通の意味空間に埋め込むために，モダリティ間のコンテキストをモデル化するためにAttentionを適用：異なるモダリティを自然に融合させる動的ポイントネットワーク (dynamic pointer network)を用いて多段階の予測によって回答を形成していく

1 Introd TextVQAのタスク︰画像中のテキストの理解と再確認を明示的に要求する問題入力された問題、画像中の視覚的なオブジェクト、画像中のテキストの3つを解釈する必要

OCR approachに基づいた手法が提案・LoRRA：OCR vocabularyを動的に解答分類に追加 manpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,Xinlei Chen, Dhruv Batra, Devi Parikh, and MarcusRohrbach. Towards vqa models that can read. InProceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 8317–8326, 2019 ・OCR　tokenをVQAモデルの出力空間に挿入 Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, andAnirban Chakraborty. Ocr-vqa: Visual question answeringby reading text in images. InProceedings of the Interna-tional Conference on Document Analysis and Recognition,2019.

これまでのモデルの問題・ 2つのモダリティの pairwise mul-timodal fusion mechanismに依存しており相互作用の種類に制限・解答予測をsingle-step classification problemとして扱う︰画像からのコピーもしくはセットからの解答を選択などの二者択一・複雑な解答を生成することは困難・画像テキストの見逃し (フォントや空間的に離れているなど)

提案モデル Transformer basedのMulti - Copy Mesh (M4C)モデル + dynamic pointerを用いた解答の生成・3つのモダリティを融合し，各モダリティかの埋め込みを同じ空間に投影・self-attentionを用いて各エンティティのrelational representaionsを獲得・複数のステップで反復的に解答を生成

2 Related works VQA based on reading and understanding image text Text VQA：概念的に類似したモデル：VQAモデルの入力と出力空間の両方にOCRトークンが追加 Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, andAnirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InProceedings of the Interna-tional Conference on Document Analysis and Recognition,2019 VQAモデルにOCR入力を単純に追加： Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez,Marc ̧al Rusi ̃nol, Minesh Mathew, CV Jawahar, Ernest Val-veny, and Dimosthenis Karatzas. Icdar 2019 competitionon scene text visual question answering.arXiv preprintarXiv:1907.00490, 2016. Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez,Marc ̧al Rusi ̃nol, Ernest Valveny, CV Jawahar, and Dimos-thenis Karatzas. Scene text visual question answering. InProceedings of the IEEE International Conference on Com-puter Vision, 2019.

Multimodal learning in vision-and-language tasks 他のモダリティを条件として別のも大リティにattentoin Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.Hierarchical question-image co-attention for visual questionanswering. InAdvances In Neural Information ProcessingSystems, pages 289–297, 2016 → 最近の研究ではtransformerによる融合本研究では各モダリティの実態をjoint embedding 空間に投影し,すべてのobjectをtransformer-basedのモデルで扱う (joint-embedding + self-attention)

Dynamic copying with pointers QAなどで画像から入力をコピーして出力 TextVQA tasksで分類器の出力にindexを付与してOCR tokensをコピーするのが今までの研究 → 単一のtokenのみコピーをおこなってきているという制限 Permutation invariant point networkを用いてtokenの順序の依存性をなくした

3 M4C Multimodal Multi-Copy Mesh (M4C)：based on a pointer augumented multimodel transformer architecture 3つのモダリティによって構成 → 3つのモダリティからの特徴を共通の空間に埋め込み・Quesiton words features ・visual object ・OCR token faetures → multi-layer transformerに適用︰ dynamic pointer networkによる反復的なdecodingで返答を予測

3.1 A common embedding space for all modalities ・Embedding of question words BERTを用いての語をベクトル

・Embedding of detected objects Faster R-CNNを用いてM個の物体の集合を取得 → M個の物体のvisual featuresとそのboxのfeatures Layer normalizationを行い線形変換し出力

・Embedding of OCR tokens with rich representations. 4種類の特徴からOCR表現を抽出して使用 N個のOCR tokensを用いて・300次元のfasttext embedding ・Faster R-CNN + RoI-poolingを用いて抽出された特徴・604次元Pyramidal Histogram of Character (OCRエラーに対してロバスト) Jon Almaz ́an, Albert Gordo, Alicia Forn ́es, and Ernest Val-veny. Word spotting and recognition with embedded at-tributes.IEEE transactions on pattern analysis and machineintelligence, 36(12):2552–2566, 2014 ・4次元OCR bounding boxes

3.2 Multimodel fusion and iterative answer prediction with pointer-augmented transformers 3つのモダリティの出力のリストにL transformer layersを適用 Dynamic pointer decodingを適用して解答を予測 Deciding t steps: OCRへの出力と固定単語の中からの出力をそれぞれ重み付けて予測 → 予測トップのものを出力

とを機械翻訳として出力解答decodingにおける因果関係を捉えるためにmaksing：前回のdecoding stepの出力にはmasking decoding stepを何回もするというよりもtransformer層を何重にも重ねているという解釈が正しそう dynamic pointer network Deepcopy: Grounded response genera-tion with hierarchical pointer networks 4 4.1 TextVQA dataset: Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,Xinlei Chen, Dhruv Batra, Devi Parikh, and MarcusRohrbach. Towards vqa models that can read. InProceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 8317–8326, 2019 使用するOCRシステム Roaster OCR system: Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar.　Rosetta: Large scale system for text detection and recogni-tion in images. InProceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & DataMining, pages 71–79. ACM, 2018

hkefka385 / paper_reading