rowanz / neural-motifs

Code for Neural Motifs: Scene Graph Parsing with Global Context (CVPR 2018)
https://rowanzellers.com/neuralmotifs
MIT License
525 stars 114 forks source link

evaluation code for recall @k #66

Closed wtliao closed 5 years ago

wtliao commented 5 years ago

Thanks for your excellent work and the nice code. When I read your evaluation code in sg_eval.py, I found something different from other people's code for visual relationship detection.

#  Line114-119 in sg_eval.py
for k in result_dict[mode + '_recall']:
     match = reduce(np.union1d, pred_to_gt[:k])
     rec_i = float(len(match)) / float(gt_rels.shape[0])
     result_dict[mode + '_recall'][k].append(rec_i)

This code section seems to caculate the recall@k through each image, and then get the final recall@k performance by averaging over all the tested images

# Line 37-40 in sg_eval.py
def print_stats(self):
        print('======================' + self.mode + '============================')
        for k, v in self.result_dict[self.mode + '_recall'].items():
            print('R@%i: %f' % (k, np.mean(v)))

I summary the step as that:

  1. get recall@k for each image, we denote it as R=[r1,r2,r3,....,rN]
  2. finall recall performance = (r1+r2+...+rN)/N=np.mean(R)

However, in the first paper of VRD "Visual Relationship Detection with Language Priors", it calculate the recall as follow:

  1. get the number of correct detected relationships on top-K results on each image, denote as correct_rel=[c1,...,cN]. The number of gt relationship is denoted as gt_rel=[g1,...,gN]
  2. recall@k = (c1+c2+...+cN)/(g1+...+gN) Some other works having published codes also calculate the recall@k in this way, such as MSDN(Line362-376), FactorizableNet(Line123-154).

I am not sure whether I understand your code correctly, or I miss something. Could you tell me whether your code does in the same way? If it is in a different one, why? Thanks a lot.

rowanz commented 5 years ago

My evaluation code is the same as xu et al's, which is the past work that I compare against in the paper. you might be right that different people use a different set of evaluation metrics, which makes comparing different approaches difficult (we noted the same in our supplemental material)