ronghanghu / snmn

Code release for Hu et al., Explainable Neural Computation via Stack Neural Module Networks. in ECCV, 2018
http://ronghanghu.com/snmn/
BSD 2-Clause "Simplified" License
72 stars 6 forks source link

Inconsistency wrt paper in code #3

Closed vardaan123 closed 5 years ago

vardaan123 commented 5 years ago

I noticed two inconsistencies in your code viz. the equations in paper (for VQA training without layout):

  1. In paper (sec 3.1 end), cv_{t,s} is obtained as a weighted combination of LSTM outputs. But in your code, the config for vqa_scratch says cfg.MODEL.CTRL.USE_WORD_EMBED = True. That means you are doing weighted combination of embeddings instead of LSTM outputs.

  2. In sharpening of stack pointer, the paper (sec 3.3 2nd para.) says use softmax, but the config for vqa_scratch says cfg.MODEL.NMN.STACK.USE_HARD_SHARPEN = True

I tried to use cfg.MODEL.CTRL.USE_WORD_EMBED = True in my PyTorch impl. but the accuracy significantly drops. Please explain.

ronghanghu commented 5 years ago

Hi, the default configuration files here is slightly different from the model described in the paper.

From our experiments, these two hyperparameters cfg.MODEL.CTRL.USE_WORD_EMBED and cfg.MODEL.NMN.STACK.USE_HARD_SHARPEN have little impact on the final accuracy.

Either the Bi-LSTM outputs or the raw word embeddings can be used as the encoding for each word, which get averaged by attention to produce a "textual command" for further processing.

Also, although soft sharpen (softmax) makes the model fully differentiable, using hard sharpen (one-hot, non-differentiable) works almost as well.