I evaluated the VQA and scene cls tasks on the model fine-tuned using GeoChatInstruct, and the results are pretty close to the metrics reported in the paper, however, the region captioning result is a bit far from the paper.
The official evaluation result:
My result:
Note that:
I finetuned the model only the first stage, which means I finetuned the LLaVA-v1.5-7b using GeoChatInstruct for only one epoch, and I did not further fine-tune the model using only referring and grounding samples since the lack of details in the paper about the stage 2 fientune.
I used the evaluate package of HuggingFace.
I wonder whether I did something wrong or the metric gap is caused by the stage2 finetune?
I evaluated the VQA and scene cls tasks on the model fine-tuned using GeoChatInstruct, and the results are pretty close to the metrics reported in the paper, however, the region captioning result is a bit far from the paper. The official evaluation result: My result: Note that:
I wonder whether I did something wrong or the metric gap is caused by the stage2 finetune?