iamaaditya / VQA_Demo

Visual Question Answering Demo on pretrained model
http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook
MIT License
242 stars 133 forks source link

Unexpected results running demo.py #4

Closed dolaameng closed 8 years ago

dolaameng commented 8 years ago

Hi,

Thank you for sharing the demo. I was trying to repeat the experiment but came across unexpected results.

My python libraries:

Keras (1.0.5)
spacy (0.101.0)
cv2(2.4.8)

when I run python demo.py, here is the result

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 0: GeForce GTX 980M (CNMeM is disabled, cuDNN 5005)

Loading image features ...
Loading question features ...
Loading VQA Model ...

Predicting result ...
80.22 %  yes
19.78 %  no
000.0 %  woman
000.0 %  train
000.0 %  man

The only change I made to the code is line 63 in demo.py

word_embeddings = spacy.load('en')#, vectors='en_glove_cc_300_1m_vectors')

I used the default vectors due to a bug in recent spacy version, which shouldnt change the result too much?

I also noticed that in pre-processing the images for VGG16, there is no mean subtraction like

img[:,:,0] -= 103.939
img[:,:,1] -= 116.779
img[:,:,2] -= 123.68

Would that cause a difference?

Appreciate your help on this! Thank you very much!

iamaaditya commented 8 years ago

Hi @dolaameng Sorry for late, reply.

  1. You are right about the mean subtraction. It should have been done and always improves the accuracy [around 1 to 2 percentage], however since this model was trained on features obtained without mean subtraction, if you want to use these models you should not add the mean subtraction otherwise your features will be different and you will get poor results.
  2. I am not quite sure what could be your issue but high percentage of yes/no might make me suggest that it is not able to get the word vectors because if you train with question as blank you are likely to get yes around and no in similar percentage range.

What the shape (and if possible sample value), when you pass a word/phrase to word_embeddings

dolaameng commented 8 years ago

HI @iamaaditya thanks for the reply! You were right about the word vectors and those were good insights!

  1. I used back the 'en_glove_cc_300_1m_vectors' following the fix and it works
  2. I also tried several images with and without the image mean subtractions and difference was not so big - although I don't really know why using the exact word vectors seems to play a bigger role than using the exact image vectors here. Appreciate it if you have any insights on it!

Again thank you for your great work!

iamaaditya commented 8 years ago

@dolaameng

Images are continuous, small changes does not change the image much and it is still recognisable as old image. (That is why generative adversarial networks work). And the vector space of image is large. Moreover the features are obtained after training a very deep (19 layers) network and thus they are agnostic to perturbations.

Words, are not continuous. You cannot take a word "apple" and do +0.0001 and expect the word to still remain apple. If you do that with the word embedding of apple, it will find itself in some high-dimensional space which is not recognisable for the word apple. This happens also because our vocabulary might be 10K (or atmsot 50K) but a real valued vector of size 300 represents a much much larger space. Think of this way, in the whole solar system, you only have few marbles that you recognize. That is way using wrong embeddings as they are trained from makes these systems useless.

dolaameng commented 8 years ago

@iamaaditya : Thanks. I really like your explanations on the granularity in text vector space and how it affects the result. I understand that one main reason for mean-subtraction on images is to avoid gradient-issues in learning phase. But I am really interested know who this affect the testing phase. To verify what you said, I did some simple tests.

  1. I see "mean-subtraction on images" (or any translation to color channels) as changing the image vectors (after VGG) in 3 directions (BGR), so I simulate the effect by grouping 300-dim of text vectors into 3 groups and applied a separate random translation to each - even though CNN an LSTM might have different nonlinear effects on vector translations. I observed that the change on images have less influences on results than change texts - like what you mentioned.
  2. However, if I just change image and text vectors slightly, e.g., by randomly translating a small portion of vectors or permutating a small portion, vqa is actually quite robust to these changes - it might be from good features learnt for generalization - not sure which part it is in the model.

I know these tests might be too preliminary to conclude anything, but I had fan exploring your model! Look forward to seeing more interesting stuff from you! Thank you!

iamaaditya commented 8 years ago

@dolaameng I am glad you had fun with it. I have another repo where I put code on how you could train your model. https://github.com/iamaaditya/VQA_Keras

Other people have uploaded much better models and have more analysis, if you get so interested in VQA task. Email me if you do not find the relevant resources.

dolaameng commented 8 years ago

Thanks! I will definitely check it out!