Visual Reference Resolution using Attention Memory for Visual Dialog

Abstract

Propose a novel attention mechanism, memory based retrieval attention in visual dialog scenario
- store all past (key, value) pairs
- use of recency
- attend and collect most relevant combination of memories from attention memory via newest text input embedding by dynamic parameter prediction

Details

Introduction
- VQA task retrieves tentative attention that allows visually grounding linguistic expressions
- Visual Dialog need to answer a sequence of questions, hence the inter-dependency of questions in dialogue presents additional challenges
- e.g, Question 2,3,5 needs knowledge from prior questions
Contributions
- [Architecture] Use of tentative attention, an context-independent attention calculated by current question and the dialogue history and retrieved attention, an inter-dependent attention that retrieves to the most relevant previous attention available in the associative memory, which dynamically combines discrete attention module to produce the final attention. Sequential dialog structure is also addressed via recency parameter
- [Baseline] Synthetic visual dialog dataset (MNIST dialog) and proposed model's performance on the visual reference resolution task
- [Benchmark] Performance on VisDial SOTA
Experiments
- MNIST Dialogue
- Performance improvement by big margin even with basic AMEM that does not use history embedding or sequential preference
- the model still has indirect access to history through attention memory, this signifies the importance in use of attention memory than using traditional history encoding
- Both AMEM and AMEM+SEQ prefers recency
- Retrieved attention actually has significant meaning and implication to final attention, somewhat of a reasoning
- VisDial dataset by MS-COCO
- questions in free form text, initial history is constructed using the caption : less focused on visual reference resolution and contains fewer ambiguous expressions compared to MNIST Dialog (ratio of ambiguous questions in MNIST Dialog 94%, in VisDial 52%)
- SOTA! with much smaller param count ~ almost 20% of baseline (MN-QH)
Model
- conv5 layer in VGG-16 trained on ImageNet is used to extract image feature map
- word embedding layers share their weights and LSTM is used for embedding the questions
- 64 dim word embedding, 128 dim hidden state for LSTM
- Adam with 0.001 with weight decay factor 0.0001

Personal Thoughts

how does dynamic parameter prediction work?
no in-depth mathematical reasonings, but is a novel architecture with big margin of SOTA improvement
Applications
- apply attention memory than traditional history encoding in NMT!
- apply sequential preference in decoder of NMT!

Link : https://arxiv.org/pdf/1709.07992.pdf Authors : Seo, et al. 2017

kweonwooj / papers

Visual Reference Resolution using Attention Memory for Visual Dialog #73

Abstract

Details

Personal Thoughts