Closed vardaan123 closed 5 years ago
Hi, the default configuration files here is slightly different from the model described in the paper.
From our experiments, these two hyperparameters cfg.MODEL.CTRL.USE_WORD_EMBED
and cfg.MODEL.NMN.STACK.USE_HARD_SHARPEN
have little impact on the final accuracy.
Either the Bi-LSTM outputs or the raw word embeddings can be used as the encoding for each word, which get averaged by attention to produce a "textual command" for further processing.
Also, although soft sharpen (softmax) makes the model fully differentiable, using hard sharpen (one-hot, non-differentiable) works almost as well.
I noticed two inconsistencies in your code viz. the equations in paper (for VQA training without layout):
In paper (sec 3.1 end),
cv_{t,s}
is obtained as a weighted combination of LSTM outputs. But in your code, the config forvqa_scratch
sayscfg.MODEL.CTRL.USE_WORD_EMBED = True
. That means you are doing weighted combination of embeddings instead of LSTM outputs.In sharpening of stack pointer, the paper (sec 3.3 2nd para.) says use softmax, but the config for
vqa_scratch
sayscfg.MODEL.NMN.STACK.USE_HARD_SHARPEN = True
I tried to use cfg.MODEL.CTRL.USE_WORD_EMBED = True in my PyTorch impl. but the accuracy significantly drops. Please explain.