Closed linzhiqiu closed 1 year ago
Not sure if we plan to release this, but my suggestion right now is to do the experiments multiple times (we did 5, but you could do more) and then compute the averages. I do not expect huge variation.
Hey @vinid thank you for the response. I have another question regarding to the evaluate_scores()
function for COCO and Flickr.
def evaluate_scores(self, scores):
if isinstance(scores, tuple):
scores_i2t = scores[0]
scores_t2i = scores[1].T # Make it N_ims x N_text
else:
scores_t2i = scores
scores_i2t = scores
preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
correct_mask = (preds == 0)
result_records = [{"Precision@1": np.mean(correct_mask)}]
return result_records
This does not align with that of VG-attribution and VG-relation in that the scores_i2t is scores[0] but not scores[1]. Is it a bug or is it intended?
I noticed that the dataset creation process is random -- making it hard to compare to the number in your paper. Are you planning to release the exact shuffled captions for these two datasets? Or do you have any recommendation for reporting the performance?