Open sdc17 opened 2 years ago
Hi, you should be able to get the reported result by running python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate
. It will automatically take care of model downloading.
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!
@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!
@sdc17 @LiJunnan1992 Any updates on this issue? I have tried both using the VQA check-pointed weights (BLIP w/ ViT-B as well as BLIP w/ ViT-B and CapFilt-L) and fine-tuned from the pre-trained weights (129M), all three experiments give around 77.4x on test-dev. Is this expected?
@lorenmt Still haven't found what the problem is. But it can be seen from the evaluation server that BLIP did achieve the reported score.
@sdc17 Thanks for the update. At least we have confirmed that we reproduced a consistent performance.
@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?
@lorenmt @sdc17 It has been reported by others that a discrepancy in PyTorch version can lead to different evaluation results. Can I know if your PyTorch is 1.10? If not, could you try to run the evaluation with PyTorch is 1.10?
Thanks for your reminder! I obtained the results with PyTorch 1.11, I will try 1.10 later.
Hello, thanks for your nice work! I am now having trouble reproducing the reported score on the VQA task. I evaluated the checkpoint downloaded from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth and followed the default settings in the ./config/vqa.yaml. However when I evaluated the generated results on the official server, I only got 77.44 on test-dev, which is significantly lower than 78.25 reported in the paper. Are there any possible reasons that could result in this performance degradation? Thanks!