Questions about the implementation

Hi! I'm reimplementing your EMNLP paper from scratch (to integrate it in my own codebase), while also following your code. There is one part of the code which isn't very clear to me, and also isn't mentioned in your paper (Line 82, model.py):

p_emb = self.linear_p(p_emb) + eps * self.linear_mini(p_emb)

Can you please let me know what is the goal of having this, and how not having it would affect the final result? Using your data (image region features, bounding box predictions, etc.), I was able to get up-to 51.38% accuracy on the Flickr30k validation set (using only region features, no label or attributes), but can't go further than that. Also, would you mind including some training/validation curves for easier debugging (if it's not too much to ask, of course).

Thanks for the great paper and repo!

qinzzz / Multimodal-Alignment-Framework

Questions about the implementation #5