technicolor-research / dsve-loc

Deep semantic-visual embedding with localization
BSD 3-Clause Clear License
58 stars 19 forks source link

Performance of the trained model in retrieving the correct caption for an unlabeled image #11

Open fmehralian opened 4 years ago

fmehralian commented 4 years ago

Hello, I'm trying to learn multi-modal embedding architecture and was confused about how the existing implementations evaluate the model. I appreciate if you could clarify that to me.

As far as I understood, for each image (such as IMG with the corresponding caption CAP), you rank a list of captions based on their similarity to the IMG embedding (link in code). I was confused by the fact that CAP still exists in the list of candidate captions. This makes sense during training; however, for testing, there might be some cases that we want to use the pre-trained model to retrieve a caption for an image without having its caption. Shouldn't we also evaluate the model's performance for this task by extracting the candidate list of captions from a bank of captions that the model already trained on and rank that list?

M-Eng commented 4 years ago

I'm not sure to completely understand your question, The evaluation done here is the standard to benchmark multimodal retrieval. To clarify the image and caption use for evaluation have never been seen by the model. And for evaluating cross-modal retrieval, we need to know the ground truth linking image to caption, in other words, to have pairs of images and captions from which we can then compute the rank of one when the other is used as query.

Maybe what is not clear is the score matrix, it contains the cosine similarity between all possible pairs of images and captions but at evaluation time we are only looking at all the columns together (for image retrieval) or all the rows (for caption retrieval)

As for your last question, it looks like a purely textual evaluation and you could do something like that to evaluate the textual branch of the model on purely textual data.