Open zhouruikun opened 11 months ago
Hi, I encountered the same issue when I downloaded the checkpoint from Beit-3's GitHub repository, but I couldn't achieve the performance mentioned in the paper. Both the retrieval task and captioning task exhibited this problem. In fact, the captioning task performed even 3.0 points worse than reported in the paper. I have serious doubts about the authenticity of Beit-3's experimental results.
Hi @zhouruikun,
We release base and large models for efficient usage. In our paper, we report the performance of a giant model. The base and large models we released, although achieving very good results, still fall short of the results of the giant model reported in the paper due to differences in model size.
@wenhui0924 Thank you for taking time out of your busy schedule to answer our questions.
May I ask whether the giant model you mentioned will be open sourced?
@wenhui0924 Can you release the pre-trained gaint model? After all, the performance in the Beit-3 paper is very strong, which is based on the gaint model, while the worse base and large models make it difficult for the following researchers to achieve the performance of SOTA, so that it is difficult to follow the Beit-3.
@wenhui0924 Can you release the pre-trained gaint model? After all, the performance in the Beit-3 paper is very strong, which is based on the gaint model, while the worse base and large models make it difficult for the following researchers to achieve the performance of SOTA, so that it is difficult to follow the Beit-3.
If we can't get giant model and checkpoint, maybe only use the performance report from GitHub as the SOTA.
Describe Model I am using (BEIT3):
the command I used : python -m torch.distributed.launch --nproc_per_node=1 run_beit3_finetuning.py --model beit3_large_patch16_224 --input_size 224 --task flickr30k --batch_size 16 --sentencepiece_model checkpoints/beit3.spm --finetune checkpoints/beit3_large_itc_patch16_224.pth --data_path datas/flickr30k --eval
I get result : Eval result = {"tr_r10": 99.60000514984131, "tr_r5": 98.90000224113464, "tr_r1": 90.70000648498535, "ir_r10": 96.84000015258789, "ir_r5": 94.11999583244324, "ir_r1": 77.97999978065491, "average_score": 93.02333196004231} Accuracy of the network on the 5000 test images: 93.023%
and same problems when I try to evaluate the performance of retrieval task of fine-tuned checkpoints on coco and flicker30k dataset,
does anything I mistaked?