The performance of the released ckpt is much lower than the scores reported in the paper

Weiyun1025 commented 1 year ago

When I directly evaluate the performance of POPE using your script and the released ckpt, the obtained performance does not match the results in your Table 7.

In your paper, the F1 scores for POPE's random, popular, and adversarial settings are 83.3, 81.8, and 79.5, respectively. However, the scores I measured are 81.8, 79.7, and 78.2.

The script is shown below:

#!/bin/bash
# POPE Evaluation
# export HF_HOME=/shared/sheng/huggingface
# export XDG_CACHE_HOME=/shared/sheng/

# export CUDA_VISIBLE_DEVICES=2 

DATA_DIR="/mnt/petrelfs/share_data/wangweiyun/llava_rlhf/data/LLaVA-RLHF-Data/coco/val2014"
MODEL_DIR="/mnt/petrelfs/share_data/wangweiyun/llava_rlhf/ckpt"

# MODEL_BASE=LLaVA-RLHF-13b-v1.5-336/sft_model
# MODEL_QLORA_BASE=LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model_ckpt250
# MODEL_QLORA_BASE=LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model

MODEL_BASE=LLaVA-RLHF-7b-v1.5-224/sft_model
MODEL_QLORA_BASE=LLaVA-RLHF-7b-v1.5-224/rlhf_lora_adapter_model

MODEL_SUFFIX=$MODEL_QLORA_BASE

# SLURM
PARTITION=INTERN4
GPUS=1
GPUS_PER_NODE=${GPUS_PER_NODE:-1}
CPUS_PER_TASK=${CPUS_PER_TASK:-12}
QUOTA_TYPE="reserved"

for POPE_CAT in popular random adversarial; do
    echo ${MODEL_SUFFIX} ${POPE_CAT}
    srun -p ${PARTITION} \
        --gres=gpu:"${GPUS_PER_NODE}" \
        --ntasks="${GPUS}" \
        --ntasks-per-node="${GPUS_PER_NODE}" \
        --cpus-per-task="${CPUS_PER_TASK}" \
        --quotatype="${QUOTA_TYPE}" \
    python model_vqa.py \
        --short_eval True \
        --model-path ${MODEL_DIR}/${MODEL_BASE}/ \
        --use-qlora True --qlora-path ${MODEL_DIR}/${MODEL_QLORA_BASE} \
        --question-file \
        ./pope/coco_pope_${POPE_CAT}.jsonl \
        --image-folder \
        ${DATA_DIR} \
        --answers-file \
        ./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.jsonl --image_aspect_ratio pad --test-prompt '\nAnswer the question using a single word or phrase.'
    python summarize_eval_pope.py \
        --answers-file ./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.jsonl \
        --label-file ./pope/coco_pope_${POPE_CAT}.jsonl \
    1>./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.out
done

Weiyun1025 commented 12 months ago

Besides, i evaluate the released 7B ckpt on MMBench, LLaVA-Bench and MMHal-Bench, and the performance is also much lower. The overall scores on MMBench, LLaVA-Bench and MMHal-Bench are 44.59, 90.8, and 1.96 separately while the reported scores are 51.4, 94.1 and 2.1.

Edward-Sun commented 12 months ago

Hi Weiyun, have you tried evaluating the SFT+ model without the RLHF lora module? Your score seems quite unexpected.

Weiyun1025 commented 12 months ago

I have evaluated the 7B SFT model. For MMBench, the overall score is 40.90. For MMHal, the average score is 2.45 and the hallucination rate is 0.55. Besides, for POPE, the F1 scores on random, popular, and adversarial settings are 85.70, 82.63, 79.79, respectively.

Edward-Sun commented 12 months ago

Hi Weiyun, the SFT model's results do not seem correct either. We will re-run an evaluation on our side to investigate the performance issue of the 7b model.

In the meantime, could please check if you have installed our custom llava patch and use torch==2.0.1+cu118? We've found the flash-attention in torch > 2.0.1 does not correctly implement the left-padding mask, which would lead to unexpected results when doing batched inference. You can also try with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False if you used a different pytorch version.

Weiyun1025 commented 12 months ago

Thank you for your advice. I have set up the environment following the guidelines, and I am using torch==2.0.1+cu118. I have also installed the patch you provided.

Additionally, I experimented with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False). However, the experimental results I obtained were almost identical to the ones mentioned above.

sIncerass commented 11 months ago

Hi @Weiyun1025, thanks for your interest!

Could you give us the scripts you are using for MMBench, we just re-tested and it matched the performance we released in the paper. Could you try to use --image_aspect_ratio square for MMBench?

Weiyun1025 commented 11 months ago

Thank you for your response! I just used your script and launched it with srun commond in slurm. The image_aspect_ratio has already been set to square. I also tried to use the eval code released in LLaVA and got similiar performance on POPE compared to the performance achieved with the code provided here. Additionally, using the ckpt released by LLaVA, i was able to reproduce their performance, so it seems that the environment and the script are not the reasons for the mismatch in performance.

Here is the shell script:

#!/bin/bash
# MMBench Evaluation
# export HF_HOME=/shared/sheng/huggingface
# export XDG_CACHE_HOME=/shared/sheng/

MMBENCH_CAT='dev'

# export CUDA_VISIBLE_DEVICES=2 

DATA_DIR="/mnt/petrelfs/share_data/wangweiyun/datasets/mmbench"
MODEL_DIR="/mnt/petrelfs/share_data/wangweiyun/llava_rlhf/ckpt"

# MODEL_BASE=LLaVA-RLHF-7b-v1.5-224/sft_model
MODEL_BASE=LLaVA-RLHF-13b-v1.5-336/sft_model
MODEL_QLORA_BASE=$1

MODEL_SUFFIX=$MODEL_QLORA_BASE

# SLURM
PARTITION=INTERN4
GPUS=1
GPUS_PER_NODE=${GPUS_PER_NODE:-1}
CPUS_PER_TASK=${CPUS_PER_TASK:-12}
QUOTA_TYPE="reserved"

mkdir -p ./eval/mmbench/${MODEL_SUFFIX}

srun -p ${PARTITION} \
    --gres=gpu:"${GPUS_PER_NODE}" \
    --ntasks="${GPUS}" \
    --ntasks-per-node="${GPUS_PER_NODE}" \
    --cpus-per-task="${CPUS_PER_TASK}" \
    --quotatype="${QUOTA_TYPE}" \
python model_mmbench.py \
    --short_eval True \
    --model-path ${MODEL_DIR}/${MODEL_BASE}/ \
    --use-qlora True --qlora-path ${MODEL_DIR}/${MODEL_QLORA_BASE} \
    --question-file \
    ${DATA_DIR}/mmbench_${MMBENCH_CAT}_20230712.tsv \
    --image-folder \
    ./eval_image/ \
    --answers-file \
    ./eval/mmbench/${MODEL_SUFFIX}/answer-file-${MMBENCH_CAT}-20230712.xlsx --image_aspect_ratio square --test-prompt '\nAnswer the question using a single word or phrase.'

# submit the answer file to https://opencompass.org.cn/mmbench-submission

and the following is the eval code:

import argparse
import torch
import os
import json
from tqdm import tqdm
import shortuuid

from llava.constants import (
    IMAGE_TOKEN_INDEX,
    DEFAULT_IMAGE_TOKEN,
    DEFAULT_IM_START_TOKEN,
    DEFAULT_IM_END_TOKEN,
    DEFAULT_IMAGE_PATCH_TOKEN,
)
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import (
    tokenizer_image_token,
    get_model_name_from_path,
    KeywordsStoppingCriteria,
)
from llava.model import *
# from LLaVA.llava.model import *
from PIL import Image
import math
from peft import PeftModel
from mmagibench import MMAGIBenchDataset
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
)

def split_list(lst, n):
    """Split a list into n (roughly) equal-sized chunks"""
    chunk_size = math.ceil(len(lst) / n)  # integer division
    return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)]

def get_chunk(lst, n, k):
    chunks = split_list(lst, n)
    return chunks[k]

def eval_model(args):
    # Model
    disable_torch_init()
    model_path = os.path.expanduser(args.model_path)
    model_name = get_model_name_from_path(model_path)
    compute_dtype = torch.float16
    if args.use_qlora:
        tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

        bits = 16
        dtype = torch.bfloat16
        compute_dtype = torch.bfloat16

        model = LlavaLlamaForCausalLM.from_pretrained(
            model_path,
            device_map={"": "cuda:0"},
            torch_dtype=dtype,
            load_in_4bit=(bits == 4),
            load_in_8bit=(bits == 8),
            quantization_config=BitsAndBytesConfig(
                load_in_4bit=(bits == 4),
                load_in_8bit=(bits == 8),
                llm_int8_threshold=6.0,
                llm_int8_skip_modules=["mm_projector", "lm_head"],
                llm_int8_has_fp16_weight=False,
                bnb_4bit_compute_dtype=compute_dtype,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
            ),
        )
        model = PeftModel.from_pretrained(
            model,
            args.qlora_path,
        )

        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
        if mm_use_im_patch_token:
            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
        if mm_use_im_start_end:
            tokenizer.add_tokens(
                [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
            )
        model.resize_token_embeddings(len(tokenizer))

        vision_tower = model.get_vision_tower()
        if not vision_tower.is_loaded:
            vision_tower.load_model()
        vision_tower.to(device="cuda", dtype=compute_dtype)
        image_processor = vision_tower.image_processor
    else:
        tokenizer, model, image_processor, context_len = load_pretrained_model(
            model_path, args.model_base, model_name
        )

    questions = MMAGIBenchDataset(data_file=args.question_file)

    # questions = [
    #     json.loads(q) for q in open(os.path.expanduser(args.question_file), "r")
    # ]
    # questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
    answers_file = os.path.expanduser(args.answers_file)
    os.makedirs(os.path.dirname(answers_file), exist_ok=True)
    # ans_file = open(answers_file, "w")
    results = []
    counter = 0
    force_words = ['A', 'B', 'C', 'D', 'E']

    force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids
    for line in tqdm(questions):
        # idx = line["question_id"]
        # image_file = line["image"]
        # image_file = 'COCO_val2014_' + image_file
        # qs = line["text"]
        image = line["img"]
        question = line["question"]
        answer = line["answer"]
        options = line["options"]
        contexts = line["context"]
        index = line["index"]
        options_dict = line["options_dict"]
        category = line["category"]
        l2_category = line["l2-category"]
        qs = contexts + "\n" + question + "\n" + options if contexts is not None else question + "\n" + options
        # print(qs)
        # exit()
        cur_prompt = qs
        if model.config.mm_use_im_start_end:
            qs = (
                DEFAULT_IM_START_TOKEN
                + DEFAULT_IMAGE_TOKEN
                + DEFAULT_IM_END_TOKEN
                + "\n"
                + qs
            )
        else:
            qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

        if args.test_prompt:
            qs += args.test_prompt

        conv = conv_templates[args.conv_mode].copy()
        conv.append_message(conv.roles[0], qs)
        conv.append_message(conv.roles[1], None)
        prompt = conv.get_prompt()

        input_ids = (
            tokenizer_image_token(
                prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"
            )
            .unsqueeze(0)
            .cuda()
        )

        # image = Image.open(os.path.join(args.image_folder, image_file))
        if args.image_aspect_ratio == 'pad':
            image = image.convert('RGB')
            def expand2square(pil_img, background_color):
                # print(background_color)
                width, height = pil_img.size
                if width == height:
                    return pil_img
                elif width > height:
                    result = Image.new(pil_img.mode, (width, width), background_color)
                    result.paste(pil_img, (0, (width - height) // 2))
                    return result
                else:
                    result = Image.new(pil_img.mode, (height, height), background_color)
                    result.paste(pil_img, ((height - width) // 2, 0))
                    return result
            image = expand2square(image, tuple(int(x*255) for x in image_processor.image_mean))
        image_tensor = image_processor.preprocess(image, return_tensors="pt")[
            "pixel_values"
        ][0]

        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
        keywords = [stop_str]
        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

        model.config.use_cache = True
        model.config.cache_shape = (2048,)

        max_new_tokens = 1024
        if args.option_scores:
            max_new_tokens = 1
        if args.short_eval:
            max_new_tokens = 64
        with torch.inference_mode():
            output_ids = model.generate(
                input_ids=input_ids,
                images=image_tensor.unsqueeze(0).to(dtype=compute_dtype).cuda(),
                do_sample=not args.option_scores,
                temperature=args.temperature,
                top_p=args.top_p,
                num_beams=1,
                # num_beams=args.num_beams,
                # no_repeat_ngram_size=3,
                max_new_tokens=max_new_tokens,
                # stopping_criteria=[stopping_criteria],
                use_cache=True,
                return_dict_in_generate=args.option_scores,
                output_scores=args.option_scores,
                # force_words_ids=force_words_ids,
            )
        if args.option_scores:
            import numpy as np
            option_idx_dict = { k: tokenizer(f" {k}", return_tensors="pt")['input_ids'][0][-1].item() for k in options_dict}
            option_scores = [ output_ids.scores[0][0][option_idx_dict[k]].item() for k in options_dict ]
            option_idx = np.argmax(option_scores)
            outputs = list(options_dict.keys())[option_idx]
        else:
            input_token_len = input_ids.shape[1]
            n_diff_input_output = (
                (input_ids != output_ids[:, :input_token_len]).sum().item()
            )
            if n_diff_input_output > 0:
                print(
                    f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids"
                )
            outputs = tokenizer.batch_decode(
                output_ids[:, input_token_len:], skip_special_tokens=True
            )[0]
            outputs = outputs.strip()
            if outputs.endswith(stop_str):
                outputs = outputs[: -len(stop_str)]
            outputs = outputs.strip()
        # print('outputs', outputs)
        # exit()
        ans_id = shortuuid.uuid()
        result = dict()
        result["question"] = question
        result["answer"] = answer
        result.update(options_dict)
        result["prediction"] = outputs
        if category is not None:
            result["category"] = category
        if l2_category is not None:
            result["l2-category"] = l2_category
        result["index"] = index
        results.append(result)
        # ans_file.write(
        #     json.dumps(
        #         {
        #             "question_id": idx,
        #             "prompt": cur_prompt,
        #             "text": outputs,
        #             "answer_id": ans_id,
        #             "model_id": model_name,
        #             "metadata": {},
        #         }
        #     )
        #     + "\n"
        # )
        # ans_file.flush()
    # ans_file.close()
    import pandas as pd
    df = pd.DataFrame(results)
    with pd.ExcelWriter(
        args.answers_file,
        engine="xlsxwriter",
    ) as writer:
        df.to_excel(writer, index=False)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
    parser.add_argument("--model-base", type=str, default=None)
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
    parser.add_argument("--conv-mode", type=str, default="llava_v1")
    parser.add_argument("--num-chunks", type=int, default=1)
    parser.add_argument("--chunk-idx", type=int, default=0)
    parser.add_argument("--temperature", type=float, default=0.2)
    parser.add_argument("--top_p", type=float, default=None)
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--use-qlora", type=bool, default=False)
    parser.add_argument("--qlora-path", type=str, default="")
    parser.add_argument("--short_eval", type=bool, default=False)
    parser.add_argument("--image_aspect_ratio", type=str, default='pad')
    parser.add_argument("--option_scores", type=bool, default=False)
    parser.add_argument("--test-prompt", type=str, default='\nAnswer the question using a single word or phrase.')
    args = parser.parse_args()

    if os.path.exists(args.answers_file):
        print(f"{args.answers_file} already exists. Please delete it first.")
        exit(1)
    eval_model(args)

Weiyun1025 commented 11 months ago

Here is the resluts of the 13B model with rlhf_lora_adapter_model_ckpt250 i got after running the aboved script. I hope it can help you to find the problem.

Weiyun1025 commented 11 months ago

Can you share the prediction results of the 13B model with rlhf_lora_adapter_model_ckpt250 on MMBench and MMHal? Our reproduced results are still not matching the reported performance.

yfzhang114 commented 10 months ago

When I directly evaluate the performance of POPE using your script and the released ckpt, the obtained performance does not match the results in your Table 7.

In your paper, the F1 scores for POPE's random, popular, and adversarial settings are 83.3, 81.8, and 79.5, respectively. However, the scores I measured are 81.8, 79.7, and 78.2.

The script is shown below:

#!/bin/bash
# POPE Evaluation
# export HF_HOME=/shared/sheng/huggingface
# export XDG_CACHE_HOME=/shared/sheng/

# export CUDA_VISIBLE_DEVICES=2 

DATA_DIR="/mnt/petrelfs/share_data/wangweiyun/llava_rlhf/data/LLaVA-RLHF-Data/coco/val2014"
MODEL_DIR="/mnt/petrelfs/share_data/wangweiyun/llava_rlhf/ckpt"

# MODEL_BASE=LLaVA-RLHF-13b-v1.5-336/sft_model
# MODEL_QLORA_BASE=LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model_ckpt250
# MODEL_QLORA_BASE=LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model

MODEL_BASE=LLaVA-RLHF-7b-v1.5-224/sft_model
MODEL_QLORA_BASE=LLaVA-RLHF-7b-v1.5-224/rlhf_lora_adapter_model

MODEL_SUFFIX=$MODEL_QLORA_BASE

# SLURM
PARTITION=INTERN4
GPUS=1
GPUS_PER_NODE=${GPUS_PER_NODE:-1}
CPUS_PER_TASK=${CPUS_PER_TASK:-12}
QUOTA_TYPE="reserved"

for POPE_CAT in popular random adversarial; do
    echo ${MODEL_SUFFIX} ${POPE_CAT}
    srun -p ${PARTITION} \
        --gres=gpu:"${GPUS_PER_NODE}" \
        --ntasks="${GPUS}" \
        --ntasks-per-node="${GPUS_PER_NODE}" \
        --cpus-per-task="${CPUS_PER_TASK}" \
        --quotatype="${QUOTA_TYPE}" \
    python model_vqa.py \
        --short_eval True \
        --model-path ${MODEL_DIR}/${MODEL_BASE}/ \
        --use-qlora True --qlora-path ${MODEL_DIR}/${MODEL_QLORA_BASE} \
        --question-file \
        ./pope/coco_pope_${POPE_CAT}.jsonl \
        --image-folder \
        ${DATA_DIR} \
        --answers-file \
        ./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.jsonl --image_aspect_ratio pad --test-prompt '\nAnswer the question using a single word or phrase.'
    python summarize_eval_pope.py \
        --answers-file ./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.jsonl \
        --label-file ./pope/coco_pope_${POPE_CAT}.jsonl \
    1>./eval/pope/answer-file-${MODEL_SUFFIX}/${POPE_CAT}.out
done

the performance is similar to mine

sIncerass commented 10 months ago

Hi @Weiyun1025 and @yfzhang114 , thanks for your interest, we recently located that this might due to a configuration in using our file, you need to turn on the --option_scores True so that it will generate the answer directly. Here is the result file we run with turning the option on.

llava-rlhf / LLaVA-RLHF

The performance of the released ckpt is much lower than the scores reported in the paper #20