Propose a novel attention mechanism, memory based retrieval attention in visual dialog scenario
store all past (key, value) pairs
use of recency
attend and collect most relevant combination of memories from attention memory via newest text input embedding by dynamic parameter prediction
Details
Introduction
VQA task retrieves tentative attention that allows visually grounding linguistic expressions
Visual Dialog need to answer a sequence of questions, hence the inter-dependency of questions in dialogue presents additional challenges
e.g, Question 2,3,5 needs knowledge from prior questions
Contributions
[Architecture] Use of tentative attention, an context-independent attention calculated by current question and the dialogue history and retrieved attention, an inter-dependent attention that retrieves to the most relevant previous attention available in the associative memory, which dynamically combines discrete attention module to produce the final attention. Sequential dialog structure is also addressed via recency parameter
[Baseline] Synthetic visual dialog dataset (MNIST dialog) and proposed model's performance on the visual reference resolution task
[Benchmark] Performance on VisDial SOTA
Experiments
MNIST Dialogue
Performance improvement by big margin even with basic AMEM that does not use history embedding or sequential preference
the model still has indirect access to history through attention memory, this signifies the importance in use of attention memory than using traditional history encoding
Both AMEM and AMEM+SEQ prefers recency
Retrieved attention actually has significant meaning and implication to final attention, somewhat of a reasoning
VisDial dataset by MS-COCO
questions in free form text, initial history is constructed using the caption : less focused on visual reference resolution and contains fewer ambiguous expressions compared to MNIST Dialog (ratio of ambiguous questions in MNIST Dialog 94%, in VisDial 52%)
SOTA! with much smaller param count ~ almost 20% of baseline (MN-QH)
Model
conv5 layer in VGG-16 trained on ImageNet is used to extract image feature map
word embedding layers share their weights and LSTM is used for embedding the questions
64 dim word embedding, 128 dim hidden state for LSTM
Adam with 0.001 with weight decay factor 0.0001
Personal Thoughts
how does dynamic parameter prediction work?
no in-depth mathematical reasonings, but is a novel architecture with big margin of SOTA improvement
Applications
apply attention memory than traditional history encoding in NMT!
Abstract
Details
Introduction
Contributions
tentative attention
, an context-independent attention calculated by current question and the dialogue history andretrieved attention
, an inter-dependent attention that retrieves to the most relevant previous attention available in the associative memory, which dynamically combines discrete attention module to produce the final attention. Sequential dialog structure is also addressed via recency parameterExperiments
MN-QH
)Model
conv5
layer in VGG-16 trained on ImageNet is used to extract image feature mapPersonal Thoughts
Link : https://arxiv.org/pdf/1709.07992.pdf Authors : Seo, et al. 2017