jwehrmann / retrieval.pytorch

Adaptive Cross-Modal Embeddings for Image-Sentence Alignment
34 stars 5 forks source link

Difference to ICCV 2019 work? #2

Closed ratthachat closed 4 years ago

ratthachat commented 4 years ago

Hi Jonatas,

Great work! I am wondering, beside multi-lingual, what's main difference or improvement over your ICCV2019 work?

Do they have similar performance?

jwehrmann commented 4 years ago

Hi! Thank you. Q: what's main difference or improvement over your ICCV2019 work? A: They are quite different. The ICCV work introduces novelty in the text encoder, in the loss function, and image encoder. That loss function makes the early stage of the training allowing for training with multiple languages even in character-level.

The AAAI work (namely, ADAPT, that is coded in this repo) uses a default text encoder, and brings novelty in the way of computing the similarity matrix (cosine for all possible image-caption pairs). We use a vector from the sentences to filter vectors that represent image regions, in a top-down manner. By doing so, the same image can be represented in several distinct ways depending on the textual query provided.

Q: Do they have similar performance? A: ADAPT work provides much higher gains in predictive performance. Currently, I am investigating the effect of ADAPT in a multilingual setting for my thesis.

ratthachat commented 4 years ago

Thanks so much! I had a chance to attend your poster at ICCV Korea and happy to see your next works!