kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Evaluation code for VQAv2 #23

Open ys-zong opened 11 months ago

ys-zong commented 11 months ago

Hi, thanks again for the nice work! I was trying to reproduce the experiments in VQAv2 using your pretrained weights and evaluate using this repo mentioned in the paper. However, I only get an accuracy of ~10%. I guess there is something wrong with my code or maybe different prompting can affect the performance. I wonder if you could push the code of VQAv2, which would be very helpful. Many thanks!

Here is a snippet of how I generate the answer:

def generate_answers(questions, model, root_path):
    results = []
    for question in tqdm(questions, desc="Generating captions"):
        img_id = question['image_id']
        img_path = os.path.join(root_path, 'val2014', id_to_imgname(img_id))
        image = utils.get_image_from_path(img_path)
        pixel_values = utils.get_pixel_values_for_model(model.model.feature_extractor, image)
        pixel_values = pixel_values.to(device=model.model.logit_scale.device, dtype=model.model.logit_scale.dtype)
        pixel_values = pixel_values[None, ...]
        imginp = model.model.get_visual_embs(pixel_values, mode='captioning')

        question_text = question['question']
        prompt_text = 'Q: ' + question_text + ' A:'
        input_ids = model.model.tokenizer(prompt_text, add_special_tokens=True, return_tensors="pt").input_ids.to(model.model.logit_scale.device)
        input_text_embedding = model.model.input_embeddings(input_ids)#[0, ...]
        input_embs = torch.cat([imginp, input_text_embedding], dim=1)#[None, ...]
        generated_ids, _, _ = model(
                                input_embs, None, None, generate=True, num_words=15, temperature=0.0, top_p=1.0)
        predicted_answer = model.model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        predicted_answer = utils.truncate_caption(predicted_answer).strip()
        question_id = question['question_id']
        results.append({
            "question_id": question_id,
            "answer": predicted_answer
        })
    return results

For prompt, I tried prompt_text = 'Q: ' + question_text + ' A:' and prompt_text = 'Q: ' + question_text + '\nA:'. The former performs slightly better.

kohjingyu commented 11 months ago

Hi, thanks for pointing this out! I realized this wasn't mentioned in the paper (we'll add it to the next arXiv version), but we do the same as MAGMA and "truncate the model output to the length of the longest ground truth answer". Could you try doing this and make sure to cast the outputs to lowercase with .lower()?

Let me know if that works! I'm traveling right now but will upload the VQA eval code used when I'm back.

ys-zong commented 11 months ago

Thanks for your reply! Yes, I have cast all the outputs to lowercase.

"truncate the model output to the length of the longest ground truth answer"

Does the "longest ground truth answer" mean the longest answer of all questions or the longest answer of each question (each question has multiple GT answers)?

kohjingyu commented 11 months ago

It should be the longest answer for each question.

ys-zong commented 11 months ago

Great! Now I can get the Acc of 27.5% after truncation. Thanks a lot for the help! Still would like to check your implementation to see the last minor difference (but absolutely no hurries).