Open howardyclo opened 6 years ago
3,487,194 total question answer pairs for 300,000 images divided into three major question types.
Some problems:
- I think the implementation part lacks of clarity. (See the comment below to understand more)
- Augment fixed vocabulary with dynamic vocabulary by adding extra classes to classifier may just learn the bias of frequent classes of dynamic vocabulary. Maybe the implementation of "Pointer-Generator Networks" is more correct.
For problems in SANDY part, I just directly ask the author. Here is the Q&As:
The static (global dictionary) look up table starts from M (M=30 in our case) for both answer and question. So all the regularly occurring words such as 'what' 'smallest' etc. start from 30.
0--30 are reserved for dynamic (or local dictionary) in both answer classes and question tokens. Index 0 refers to the word closest the bottom left of the chart, Index 1 refers to the word nearest to word 0. and so on.
As a result, we have converted words, regardless of whether they are seen before or novel, into their relative position in the chart. This means, we can encode the question as well as answer words into tokens which are no longer arbitrary but grounded in their position.
No changes need to be made for the rest of the architecture. The embedding is carried out by regular embedding layer.
Perhaps some example will help. Imagine the image attached.
Q1. Which bar has the highest value? A: week Q2. What is the value of song? A: "1"
For this image, using OCR, our dictionary might be something like {0:season, 1:camera, 2:song, 3:week, 5: 0, 6:2, 7:4, 8:values, 9:6, 10:8, 11:10, 12:title, 13-30 are undefined} and global dictionary something like say, {30: which, 31: highest, 32:bar, 33: what etc}, this is for all images, not only this one. but the local dictionary changes for each image based on word positions.
So, tokenizing Q1 doesn't make use of local dictionary because its not needed, e.g, Q1: [30,32, ...] but A1 contains chart-specific answer, so the ground truth is class 3
Similarly, Q2, For words, what, is, the, value, of, local dictionary is used, but for song we again use dynamic dictionary which converts the word "song' to token 2. So the question may be tokenized as: [33, .., .., .., .. ,.. , 2]
After this, there is no need to change any of the further pipeline, e.g. can use any regular VQA model.
I think this implementation will make the model learn the bias of frequent classes of local dictionary (permutation-variant to the relative position of novel words), i.e., if we change the ids of novel words in local dictionary, the model may failed to predict the right novel word. And the embedding matrix also have the problem too. (0th to 29th embeddings are meaningless because it’s corresponding word vary from the chart). Please feel free to correct me if I am wrong.
The dynamic encoding schemes has several shortcomings that can be improved in future works (e.g., If even a single OCR is imperfect, the whole encoding can fail for the image), but I don't think it has the problem that you mentioned. 0th and 29th position are not meaningless because they are grounded in relative position. I find it easy to think of converting questions from direct to indirect reference: E.g., Q. What is the value of XXX? A: 10 ==> Q: What is the value of third text from the left? A:10 Q. Which bar is the highest? A: XXX ==> Q: Which bar is the highest? A: second text from the left
Like I said, this is dependent on the success of OCR and visual processing pipeline to make the connection that the 3rd text from left refers to the nth bar, but it makes it feasible for the algorithm to learn that. Otherwise, classification-based systems have no chance to answer novel words. I'm going to try to respond to some key points.
make the model learn the bias of frequent classes of local dictionary Yes, that is possible. While our data-set is randomly distributed, it may not remove all kinds of hidden correlations. E.g., Since charts can have 3-9 bars, the first, second and third bars are always present, but each successive bar has less and less chance of being present. So, for "How many bars are there?" Answering 3 blindly will have higher chance of being right. We have tried to combat this to some degree (Section 3.2). But there can be many more forms of spurious correlations that still exist but the chances of large-scale correlations are minimal as we have randomized most elements in the chart.
permutation-variant to the relative position of novel words Yes, and that is precisely the point. We are no longer concerned with what the actual word is, SANDY's task is to learn to parse the indices 0-30 in terms of relative position w respect to each image. The model doesn't know (or even need to know) what actual word we are talking about. Only that it is the Nth word in the chart. You can replace that word with any other new word (or permute it), the answer is still N. The dictionary is then used to "decode" what N means, which is entirely detached from the whole learning process. The model never even gets to see that word.
if we change the ids of novel words in local dictionary, the model may failed to predict the right novel word
But we don't change the ids of the words arbitrarily, 0 always represents "the first text on the bottom-left". As seen in the results, there is no drop in performance for SANDY for completely new words as it has for already seen words.
That being said, there are a lot of possible directions of improvement in dealing with char-specific labels, whether they are novel words or even seen words that still need to be addressed. I hope to see some interesting work in future.
hey, Do you know why the L word global dictionary for SANDY has just 77 words ( 107 -30) ? MOMs classification branch and other baseline models use a much larger set of classes (1000 +). And how are these words/classes selected ?
Metadata