About test results - Githubissues

peteanderson80 / bottom-up-attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

http://panderson.me/up-down-attention/

MIT License

1.44k stars 378 forks source link

About test results #4

Open jwyang opened 7 years ago

jwyang commented 7 years ago

Hi, I just run the test code using your trained resnet101 model on the test set. I got the following numbers on object detection task:

Mean AP = 0.0146 Weighted Mean AP = 0.1799 Mean Detection Threshold = 0.328

The mean AP (1.46%) is far from the number (10.2%) you reported in the table at the bottom of readme. The weighted mean AP is a bit higher than the number you reported. I am wondering whether there is a typo in your table.

thanks!

peteanderson80 commented 7 years ago

The mean and weighted mean numbers should be much closer than your results - the only difference is correction for class imbalance. Are you still having an issue with this?

jwyang commented 7 years ago

Hi, Peter,

I checked the evaluation code again. The mean AP is computed by averaging over all 1600 entries in app, that is:

print('Mean AP = {:.4f}'.format(np.mean(aps)))

and the weighted mean AP is computed via:

print('Weighted Mean AP = {:.4f}'.format(np.average(aps, weights=weights)))

Since there are merely a part of 1600 categories that appear in the test set (231 from my running), aps will have many zeros in it. In this case, mean(aps) should be low with no doubt.

I guess you reported the mean AP by ruling out all categories with npos = 0, and then get the average on those non-zero entries? When I did like this, I got 10.11%. It is very close to your reported numbers.

yuzcccc commented 7 years ago

Hi, Peter, I am running the training scripts myself (with fewer gpus). What's number of the final training loss at iteration 380K when you trained the model? . If possible, could you please draw a training curve or provide the training log file? Thanks a lot!

peteanderson80 commented 7 years ago

Hi @jwyang,

Sorry I haven't responded sooner. We did not exclude zeros in our calculation. It seems like there is some difference in the validation set that is being used, because our 5000 image validation set resulted in no categories with npos = 0 in our evaluation.

Maybe something went wrong with the dataset preprocessing? To help compare I've added the eval.log file from our evaluation to the repo. If it helps I can also add our preprocessed data/cache/vg_1600-400-20_val_gt_roidb.pkl file.

jwyang commented 7 years ago

Hi, @peteanderson80 ,

thanks a lot for your replying, and sharing the log file. Yeah, it is very weird to me. I compared the 5000 validation images, they are the same. I will re-pull your code and re-generate the xml files to see whether I can get the same number as yours. I will let you know when I get the results.

thanks again for your help!

peteanderson80 commented 7 years ago

Hi @yuzcccc,

I don't have the original log file, but I've added an example log file from training with a single gpu for 16K iterations, which should give some indication of the expected training loss. From memory I think the final training loss was around 4.0 (compared to about 4.8 in the example log file at iteration 16300).

peteanderson80 commented 7 years ago

Thanks @jwyang for investigating. I have shared our pickled datasets so you can see if you get the same:

jayleicn commented 5 years ago

@jwyang So what makes your accuracy lower than the reported one? I used maskrcnn-benchmark code to train/test the same splits, only got 2.24% mAP ( IoU 0.5).