howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding #31

Open howardyclo opened 5 years ago

howardyclo commented 5 years ago

Metadata

howardyclo commented 5 years ago

Summary

Neural-symbolic VQA (NS-VQA) model has three components:


Scene Parser

Question Parser

Program Executor


Evaluation: Data-Efficient, Interpretable Reasoning

Dataset

CLEVR. The dataset includes synthetic images of 3D primitives with multiple attributes—shape, color, material, size, and 3D coordinates. Each image has a set of questions, each of which associates with a program (a set of symbolic modules) generated by machines based on 90 logic templates.


Quantitative results

Repeated experiments starting from different sets of programs show a standard deviation of less than 0.1 percent on the results for 270 pretraining programs (and beyond). The variances are larger when we train our model with fewer programs (90 and 180). The reported numbers are the mean of three runs.


Data-efficiency comparison


Qualitative examples

IEP tends to fake a long wrong program that leads to the correct answer. In contrast, NS-VQA achieves 88% program accuracy with 500 annotations, and performs almost perfectly on both question answering and program recovery with 9K programs.


Evaluation: Generalizing to Unseen Attribute Combinations

Dataset

CLEVR-CoGenT. Derived from CLEVR and separated into two biased splits:


Results

See Table 2a.


Evaluation: Generalizing to Questions from Humans

Dataset

Results

See Table 2b.

This shows our structural scene representation and symbolic program executor helps to exploit the strong exploration power of REINFORCE, and also demonstrates the model’s generalizability across different question styles.


Evaluation: Extending to New Scene Context

Dataset

Results


Future Research

Beyond supervised learning, some recent papers have made inspiring attempts to explore how concepts naturally emerge during unsupervised learning by Irina Higgins et al. [4] (See related work on Structural scene representation in the paper.) We see integrating our model with these approaches a promising future direction.


Reference