mlpc-ucsd / BLIVA

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
https://arxiv.org/abs/2308.09936
BSD 3-Clause "New" or "Revised" License
257 stars 26 forks source link

could you provide the dataset code for visdial and the test code? #18

Closed Thinking-more closed 8 months ago

gordonhu608 commented 8 months ago

Thank you for your interest in our work. For visdial, we directly downloaded their official dataset and the test code is employed as here: https://github.com/mlpc-ucsd/BLIVA/blob/b45425a7c87d01ecc075d86c9f2376689a1c80db/bliva/models/bliva_vicuna7b.py#L519-L521

Thinking-more commented 8 months ago

Thank you for your prompt reply!

Thinking-more commented 8 months ago

The official dataset code doesn't seem to be directly adapted to this model, could you provide the dataset code you are using or give a more concrete example? it's crucial for me to reproduce the experimental results. Is the following example correct: This image has the caption "a bedroom is filled with lots of posters and a busy computer desk". Dialog history: q: is the photo in color? a: yes. q: is it a professional photo? a: no. q: is it well lit? a: no. q: is it daytime? a: i don't see windows. q: does this look like an adults bedroom? a: maybe. q: is the room clean? a: no. q: can you tell what kind of computer it is? a: no not really. q: is it a flat screen? a: yes. q: what's the desk made out of? a: cheap plastic or wood.\nQuestion: is there a computer chair? Short answer:

gordonhu608 commented 8 months ago

Yes. As mentioned in the paper, we followed the same evaluation prompt as InstructBLIP. For VisDial, it's like '' Dialog history: {}\n Question: {} Short answer: '''. Notice that Visual Dialog evaluation needs heavy GPU memory due to its long context nature. Hope this is helpful.

gordonhu608 commented 8 months ago

Also, as reported, we used Mean Reciprocal Rank (MRR) as the evaluation metric for VisDial. Hope it's helpful.

Thinking-more commented 8 months ago

Thank you for your prompt reply, but at the moment I'm experiencing some problems. I tested the model with my inputs as: ["dialog history: q: is the photo in color a: yes q: is it a professional photo a: no q: is it well lit a: no q: is it daytime a: i don't see windows q: does this look like an adults bedroom a: maybe q: is the room clean a: no q: can you tell what kind of computer it is a: no not really q: is it a flat screen a: yes q: what's the desk made out of a: cheap plastic or wood\nquestion: is there a computer chair short answer:"]. I calculated 2064 times 10 for a total of 20640 samples and ended up with a mrr of only 9. It must be some settings that doesn't agree with you. Could you please point out the differences, such as whether or not to capitalize initials, whether or not to include punctuation, whether or not to include caption, and whether or not to spell out question and answer in the dialog history (I'm using the visdial dataset from lavis, which uses 'q' as well as 'a' as abbreviations, and doesn't take into account punctuation). My test script is as follows `import json import torch

def cal_mrr(path): with open(path, 'r') as f: pred = json.load(f)

ranks = []
for i, data in enumerate(pred):
    answer = data['answer']
    candidates = data['candidates']
    idx = candidates.index(answer)
    rank = data['ranks'][idx] + 1
    ranks.append(rank)

ranks = torch.tensor(ranks).float()
mrr = torch.mean(ranks.reciprocal()).item()

return mrr

` Looking forward to your reply, thanks!

Thinking-more commented 8 months ago

I would appreciate it if you could provide me with your code for visdial.

gordonhu608 commented 8 months ago

Got it, I suggest to run Visdial in our framework. Can you first check your model checkpoint? Are you able to reproduce results for other datasets?

Thinking-more commented 8 months ago

Sorry, I'm not familiar with the other tasks. But I may have found the problem, in that max_txt_len is set to 128 by default, and since the input contains the dialog history, there must be a lot of samples with a length greater than 128. how do you handle the situation in your tests? Again. Could you provide me with a sample of your test visdial input so that I can be completely consistent with your formatting? Thank you very much for your patience!

gordonhu608 commented 8 months ago

We didn't apply special treatment for VisDial, we kept everything as default. Another detail is that, as mentioned in InstructBLIP's paper (section inference), many datasets including VisDial are evaluated using ranked based approach which is in this file https://github.com/mlpc-ucsd/BLIVA/blob/5c64594d15350cc4472fa5e6c64f98bbe34670b3/evaluate.py#L53-L63

Thinking-more commented 8 months ago

Thanks!

Thinking-more commented 8 months ago

I misunderstood the function of argsort, which led to a bug in the test script, and got normal experimental results so far.