Open howardyclo opened 6 years ago
Neural-symbolic VQA (NS-VQA) model has three components:
CLEVR. The dataset includes synthetic images of 3D primitives with multiple attributes—shape, color, material, size, and 3D coordinates. Each image has a set of questions, each of which associates with a program (a set of symbolic modules) generated by machines based on 90 logic templates.
Repeated experiments starting from different sets of programs show a standard deviation of less than 0.1 percent on the results for 270 pretraining programs (and beyond). The variances are larger when we train our model with fewer programs (90 and 180). The reported numbers are the mean of three runs.
IEP tends to fake a long wrong program that leads to the correct answer. In contrast, NS-VQA achieves 88% program accuracy with 500 annotations, and performs almost perfectly on both question answering and program recovery with 9K programs.
CLEVR-CoGenT. Derived from CLEVR and separated into two biased splits:
See Table 2a.
See Table 2b.
This shows our structural scene representation and symbolic program executor helps to exploit the strong exploration power of REINFORCE, and also demonstrates the model’s generalizability across different question styles.
Beyond supervised learning, some recent papers have made inspiring attempts to explore how concepts naturally emerge during unsupervised learning by Irina Higgins et al. [4] (See related work on Structural scene representation in the paper.) We see integrating our model with these approaches a promising future direction.
Metadata