salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.65k stars 943 forks source link

Prompt for BLIP2 OPT VQA #138

Open codezakh opened 1 year ago

codezakh commented 1 year ago

Hi @LiJunnan1992 and @dxli94, thanks for the great work! Do you have examples of the prompts you found to work well with the OPT version of BLIP2 for VQA? I tried using FLAN-T5's prompt, I'm wondering if you guys used a different prompt for OPT.

YuanLiuuuuuu commented 1 year ago

Can you reproduce the results for OPT zeroshot VQA in the paper?

codezakh commented 1 year ago

Hi @YuanLiuuuuuu, this comment describes how they were able to reproduce the BLIP-2 OPT results. I haven't tried it yet: https://github.com/salesforce/LAVIS/issues/188

YuanLiuuuuuu commented 1 year ago

@codezakh Thanks!

dszpr commented 6 months ago

Hi! @codezakh @YuanLiuuuuuu I noticed that in forward function in blip2_opt.py, only the questions in the VQA dateset are used. Both the text_input and the target lables are derived from the opt_tokens `

    text = [t + "\n" for t in samples["text_input"]]

    opt_tokens = self.opt_tokenizer(
        text,
        return_tensors="pt",
        padding="longest",
        truncation=True,
        max_length=self.max_txt_len,
    ).to(image.device)

    targets = opt_tokens.input_ids.masked_fill(
        opt_tokens.input_ids == self.opt_tokenizer.pad_token_id, -100
    )
    if self.prompt:
        targets[:, : self.prompt_length] = -100  # do not apply loss to the prompt

    empty_targets = (
        torch.ones(atts_opt.size(), dtype=torch.long).to(image.device).fill_(-100)
    )
    targets = torch.cat([empty_targets, targets], dim=1)

    inputs_embeds = self.opt_model.model.decoder.embed_tokens(opt_tokens.input_ids)
    inputs_embeds = torch.cat([inputs_opt, inputs_embeds], dim=1)
    attention_mask = torch.cat([atts_opt, opt_tokens.attention_mask], dim=1)

    with self.maybe_autocast():
        outputs = self.opt_model(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            return_dict=True,
            labels=targets,
        )

` In the VQA task, are the target lables supposed to be the answers in the VQA dataset? But the answers in the VQA datasetare not used in the blip2_opt.py. However, the answers are used as target lables in blip2_t5.py. It really confused. Have you make any change to blip2_opt.py?