About evaluation scores on VQA v2.0 dataset

salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BSD 3-Clause "New" or "Revised" License

4.86k stars 648 forks source link

About evaluation scores on VQA v2.0 dataset #59

Open sdc17 opened 2 years ago

sdc17 commented 2 years ago

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

LiJunnan1992 commented 2 years ago

Hi, you should be able to get the reported result by running python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate. It will automatically take care of model downloading.

lorenmt commented 2 years ago

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?

sdc17 commented 2 years ago

Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!

@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?

@lorenmt Still haven't found what the problem is. But it can be seen from the evaluation server that BLIP did achieve the reported score.

lorenmt commented 2 years ago

@sdc17 Thanks for the update. At least we have confirmed that we reproduced a consistent performance.

LiJunnan1992 commented 2 years ago

@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?

sdc17 commented 2 years ago

@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?

Thanks for your reminder! I obtained the results with PyTorch 1.11, I will try 1.10 later.