DVQA: Understanding Data Visualizations via Question Answering

howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.

128 stars 6 forks source link

Summary

This paper presents "DVQA", a challenging VQA dataset for bar chart (structure, data retrieval and reasoning).
This paper demonstrates exsiting VQA algorithms can not perform well on DVQA.
This paper purposes 2 strong baselines that perform considerably better.

Differences from Existing VQA

Existing VQA mostly operates on natural images.
Existing VQA assumes 2 fixed vocabulary for encode Q and A, thus cannot deal with out-of-vocabulary (OOV) words.
Small changes in an natural image typically have little impact on a question in VQA, while in chart images, even small changes can completely alter the information in the chart.
DVQA requires "reading" text from a chart to correctly answer them.

Differences from Concurrent Work — FigureQA

It contains only yes/no type questions.
It does not contain questions that require numeric values as answers.
It has fixed labels for bars across different figures. (e.g., a red bar is always labeled ’red’).
It avoids the OOV problem.

Dataset Statistics

3,487,194 total question answer pairs for 300,000 images divided into three major question types.

Dataset Synthesis Process

Select a visual style for a chart, choosing data for a chart, and then generating questions for that chart.
Plotting tool: Matplotlib.
Visual factors:
- Number of bars and groups.
- Presence or absence of grid lines.
- Difference in color, width, spacing, orientation, and texture of the bars.
- Difference in the orientation and the location of labels and legends.
Select the 1000 most frequent nouns in the Brown Corpus using NLTK’s part-of-speech tagging for our training set and our ‘easy’ test set Test-Familiar. To measure a system’s ability to scale to unknown answers, we also created a more difficult test set Test-Novel, in which we use 500 new words that are not seen during training.
Data scales: Linear (1-10), percentage (10-100), and exponential (1-10^10). A small percentage of bars are allowed to have a value of zero which appears as a missing bar in the chart.
Generate questions from fixed template.
Balance answers to minimize dataset bias.

Baselines

YES: This model answers 'YES' for all questions.
IMG: A question-blind model. ResNet-152 with 448 × 448 images + 1-layer 1024 MLP.
QUES: A image-blind model. 1-layer 1024 LSTM + 1-layer 1024 MLP.
IMG+QUES: Concat CNN+LSTM features + 1-layer 1024 MLP.
SAN-VQA: The Stacked Attention Network (SAN) for VQA. (State-of-the-art on VQA 1.0 and 2.0).

Purposed Model 1: Multi-Output Model (MOM)

Two sub-network perform classification or OCR to answer different kinds of questions.
One separate binary classifier that determines which of the outputs to trust.
Problems:
- It has no explicit ability to visually read bar chart text and its LSTM question encoding cannot handle chart-specific words. (It just encode image regions from OCR instead of strings)
- The character decoder will failed to produce novel words.

Purposed Model 2: SAN with DYnamic Encoding Model (SANDY)

Augment N-word global dictionary (fixed vocabulary) with a M-word local dictionary (dynamic vocabulary) of chart-specific words detected from OCR. This creates (M+N)-word dictionary to encode chart-specific questions. (M is fixed to 30 in advanced).
The M-word local dictionary also augments L-word global answer dictionary.
They augment global dictionary by adding M extra classes to the classifier representing the dynamic chart-specific words.
Two versions:
- One assumes OCR's output is perfect (from dataset annotation)
- The other one uses the output of open-source Tesseract OCR.
- Tesseract’s output is pre-processed in three ways:
  - Only use words with alphabetical characters in them.
  - Filter word detections with confidence less than 50%.
  - Filter single-character word detections.

Some problems:

I think the implementation part lacks of clarity. (See the comment below to understand more)

Augment fixed vocabulary with dynamic vocabulary by adding extra classes to classifier may just learn the bias of frequent classes of dynamic vocabulary. Maybe the implementation of "Pointer-Generator Networks" is more correct.

Training

All of the classification based systems, except SANDY and the OCR branch of MOM, use a global answer dictionary from training set containing 1076 words, so they each have 1076 output units.
MOM’s OCR branch contains 27 output units; 1 for each alphabet and and 1 reserved for blank character.
SANDY’s output layer contains 107 units, with the indices 31 through 107 are reserved for common answers and indices 0 through 30 are reserved for the local dictionary.

Experiment Results

For problems in SANDY part, I just directly ask the author. Here is the Q&As:

The detail of how SANDY encodes novel words

The static (global dictionary) look up table starts from M (M=30 in our case) for both answer and question. So all the regularly occurring words such as 'what' 'smallest' etc. start from 30.
0--30 are reserved for dynamic (or local dictionary) in both answer classes and question tokens. Index 0 refers to the word closest the bottom left of the chart, Index 1 refers to the word nearest to word 0. and so on.
As a result, we have converted words, regardless of whether they are seen before or novel, into their relative position in the chart. This means, we can encode the question as well as answer words into tokens which are no longer arbitrary but grounded in their position.
No changes need to be made for the rest of the architecture. The embedding is carried out by regular embedding layer.

Perhaps some example will help. Imagine the image attached.

Q1. Which bar has the highest value? A: week Q2. What is the value of song? A: "1"

For this image, using OCR, our dictionary might be something like {0:season, 1:camera, 2:song, 3:week, 5: 0, 6:2, 7:4, 8:values, 9:6, 10:8, 11:10, 12:title, 13-30 are undefined} and global dictionary something like say, {30: which, 31: highest, 32:bar, 33: what etc}, this is for all images, not only this one. but the local dictionary changes for each image based on word positions.

So, tokenizing Q1 doesn't make use of local dictionary because its not needed, e.g, Q1: [30,32, ...] but A1 contains chart-specific answer, so the ground truth is class 3

Similarly, Q2, For words, what, is, the, value, of, local dictionary is used, but for song we again use dynamic dictionary which converts the word "song' to token 2. So the question may be tokenized as: [33, .., .., .., .. ,.. , 2]

After this, there is no need to change any of the further pipeline, e.g. can use any regular VQA model.

Some Issues

Question

I think this implementation will make the model learn the bias of frequent classes of local dictionary (permutation-variant to the relative position of novel words), i.e., if we change the ids of novel words in local dictionary, the model may failed to predict the right novel word. And the embedding matrix also have the problem too. (0th to 29th embeddings are meaningless because it’s corresponding word vary from the chart). Please feel free to correct me if I am wrong.

Answer

The dynamic encoding schemes has several shortcomings that can be improved in future works (e.g., If even a single OCR is imperfect, the whole encoding can fail for the image), but I don't think it has the problem that you mentioned. 0th and 29th position are not meaningless because they are grounded in relative position. I find it easy to think of converting questions from direct to indirect reference: E.g., Q. What is the value of XXX? A: 10 ==> Q: What is the value of third text from the left? A:10 Q. Which bar is the highest? A: XXX ==> Q: Which bar is the highest? A: second text from the left

Like I said, this is dependent on the success of OCR and visual processing pipeline to make the connection that the 3rd text from left refers to the nth bar, but it makes it feasible for the algorithm to learn that. Otherwise, classification-based systems have no chance to answer novel words. I'm going to try to respond to some key points.

make the model learn the bias of frequent classes of local dictionary Yes, that is possible. While our data-set is randomly distributed, it may not remove all kinds of hidden correlations. E.g., Since charts can have 3-9 bars, the first, second and third bars are always present, but each successive bar has less and less chance of being present. So, for "How many bars are there?" Answering 3 blindly will have higher chance of being right. We have tried to combat this to some degree (Section 3.2). But there can be many more forms of spurious correlations that still exist but the chances of large-scale correlations are minimal as we have randomized most elements in the chart.

permutation-variant to the relative position of novel words Yes, and that is precisely the point. We are no longer concerned with what the actual word is, SANDY's task is to learn to parse the indices 0-30 in terms of relative position w respect to each image. The model doesn't know (or even need to know) what actual word we are talking about. Only that it is the Nth word in the chart. You can replace that word with any other new word (or permute it), the answer is still N. The dictionary is then used to "decode" what N means, which is entirely detached from the whole learning process. The model never even gets to see that word.

if we change the ids of novel words in local dictionary, the model may failed to predict the right novel word
But we don't change the ids of the words arbitrarily, 0 always represents "the first text on the bottom-left". As seen in the results, there is no drop in performance for SANDY for completely new words as it has for already seen words.

That being said, there are a lot of possible directions of improvement in dealing with char-specific labels, whether they are novel words or even seen words that still need to be addressed. I hope to see some interesting work in future.

howardyclo / papernotes