Cannot reproduce MM-Vet score

TideDra commented 11 months ago

Hi, I try to reproduce your results, and the MME score and MMHal-Bench score I got is roughly consistent with the results you report in the paper, but the MM-Vet score I got is 48.2, while your result is 49.9. Moreover, the MM-Vet score of the raw Qwen-VL-Chat baseline I got is also 48.2, which means this score is not improved after dpo, while your baseline score is 45.7. I'm using the latest Qwen-VL-Chat checkpoint and your raw codebase. I wonder what causes the difference of MM-Vet score of both baseline model and dpo model. Thanks!

TobiasLee commented 11 months ago

Hi, thanks for your question. Our evaluation results are based on the Qwen-VL-Chat ckpt and the results we obtained using the calculator.ipynb provided by MM-Vet are attached:

,rec,ocr,know,gen,spat,math,total,std,runs QwenVL-Chat,52.3,34.6,43.1,39.7,34.7,18.8,45.7,0.0,[45.7] Silkie,55.4,37.8,46.3,42.0,42.1,22.7,49.9,0.0,[49.9]

For the score difference, what's the version of GPT evaluator you are using? We are using "gpt-4-0613" as the evaluator.

TideDra commented 11 months ago

Hi, thanks for your question. Our evaluation results are based on the Qwen-VL-Chat ckpt and the results we obtained using the calculator.ipynb provided by MM-Vet are attached:

,rec,ocr,know,gen,spat,math,total,std,runs QwenVL-Chat,52.3,34.6,43.1,39.7,34.7,18.8,45.7,0.0,[45.7] Silkie,55.4,37.8,46.3,42.0,42.1,22.7,49.9,0.0,[49.9]

For the score difference, what's the version of GPT evaluator you are using? We are using "gpt-4-0613" as the evaluator.

I'm using the huggingface space provided by MM-Vet. Here is my results:

,rec,ocr,know,gen,spat,math,total,std,runs
Qwen-VL-Chat_raw_eval_code,51.9,41.7,41.2,38.2,44.4,30.0,48.5,0.0,[48.5]
silkie_merged_raw_eval_code,54.5,36.8,46.1,44.5,38.4,18.8,48.3,0.0,[48.3]

My evaluation code is modified from the inference code provided by official Qwen-VL codebase:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
import json
from torch.utils.data import Dataset
import os
from tqdm import tqdm
# Note: The default behavior now has injection attack prevention off.

class MMVetDataset(Dataset):
    def __init__(self,data_root) -> None:
        super().__init__()
        self.data_root = data_root
        with open(os.path.join(data_root, "mm-vet.json"), "r") as f:
            data = json.load(f)
        self.data = [(k,v) for k,v in data.items()]
    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return {'id':self.data[index][0],
                'image':os.path.join(self.data_root,'images',self.data[index][1]['imagename']),
                'question':self.data[index][1]['question']}

tokenizer = AutoTokenizer.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("/mnt/gozhang/code/VLFeedback/ckpts/silkie_merged", trust_remote_code=True)

dataset = MMVetDataset("/mnt/gozhang/code/VLFeedback/data_dir/mm-vet")

results = {}

bar = tqdm(total=len(dataset))
for data in iter(dataset):
    # 1st dialogue turn
    image = data['image']
    question = data['question']
    query = tokenizer.from_list_format([
        {'image': image}, # Either a local path or an url
        {'text': question},
    ])
    response, history = model.chat(tokenizer, query=query, history=None)
    results[data['id']] = response
    bar.update(1)

with open('mmvet_results.json','w') as f:
    json.dump(results,f,indent=4)

Can you share your score evaluated by this huggingface space? Maybe there is difference between the space and the notebook script.

TobiasLee commented 11 months ago

Below is the decoding script I am using:

# some code for preparing the Qwen-VL-Chat Model and Tokenizer 

test_set = json.load(open("mm-vet.json"))
ret = {} 
for sample_id in tqdm(test_set):
    image_file = os.path.join("images", test_set[sample_id]["imagename"])
    query = f'<img>{image_file}</img>\n{test_set[sample_id]["question"]}'
    response, _ = model.chat(tokenizer, query=query, history=None)
    ret[sample_id] = response

# save results
with open(f"results/{ckpt_name}.json", "w") as f:
    json.dump(ret, f, indent=4)

Decoded Results

QwenVL-Chat.json Silkie.json

Results from the Model Space

,rec,ocr,know,gen,spat,math,total,std,runs
Qwen-VL-Chat 52.2,34.1, 43.5, 39.5, 34.0,   18.8,   45.6,   0.0,    [45.6]
Silkie, 55.7,   37.0,   46.8,   42.4,   42.0,   18.8,   49.5,   0.0,    [49.5]

The results are consistent with the results from the notebook.

TideDra commented 11 months ago

Below is the decoding script I am using:

# some code for preparing the Qwen-VL-Chat Model and Tokenizer 

test_set = json.load(open("mm-vet.json"))
ret = {} 
for sample_id in tqdm(test_set):
    image_file = os.path.join("images", test_set[sample_id]["imagename"])
    query = f'<img>{image_file}</img>\n{test_set[sample_id]["question"]}'
    response, _ = model.chat(tokenizer, query=query, history=None)
    ret[sample_id] = response

# save results
with open(f"results/{ckpt_name}.json", "w") as f:
    json.dump(ret, f, indent=4)

Decoded Results

QwenVL-Chat.json Silkie.json

Results from the Model Space

,rec,ocr,know,gen,spat,math,total,std,runs
Qwen-VL-Chat 52.2,34.1,   43.5, 39.5, 34.0,   18.8,   45.6,   0.0,    [45.6]
Silkie, 55.7, 37.0,   46.8,   42.4,   42.0,   18.8,   49.5,   0.0,    [49.5]

The results are consistent with the results from the notebook.

Thanks! I tried your decoding script and I got your results. So the difference comes from the prompt format. tokenizer.from_list_format adds Picture 1:<img>{img_path}</img>\n prefix to the prompt, while your prompt has no Picture 1: prefix. This is really weird because the training data actually uses this prefix, and the inference should follow this setting to get best performance. Anyway, the score 49.5 proves your work is effective.

vlf-silkie / VLFeedback