salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

How to convert the text_features into text or input_ids correctly #142

Open nuistZPZ opened 4 weeks ago

nuistZPZ commented 4 weeks ago

您在文中《Align before Fuse: Vision and Language Representation Learning with Momentum Distillation》提到了对图像文本对的可视化,展示了ALBEF模型根据图片输出文本,这说明ALBEF具有这项能力,但ALBEF只在model_vqa.py中有decoder,想知道您是如何生成文本的? 我借鉴了BLIP文章中的文本生成方式,使用huggingface中transformers库中的BERT预训练模型作为text_decoder,代码如下所示,但生成的结果很奇怪,总是相同的几个单词,并且虽然loss已经降低了,但生成出的文本的效果依然很差。

-----------code--------------

text_decoder = BertLMHeadModel.from_pretrained(text_decoder, config=config_decoder)   
num_beams = 3

question_states = text_output.last_hidden_state.repeat_interleave(num_beams, dim=0)
question_atts = torch.ones(question_states.size()[:-1], dtype=torch.long).to(question_states.device)
model_kwargs = {"encoder_hidden_states": question_states, "encoder_attention_mask": question_atts}

bos_ids = torch.full((image.size(0), 1), fill_value=0, device=image.device)

outputs = text_decoder.generate(input_ids=bos_ids,
                                max_length=10,
                                min_length=1,
                                num_beams=num_beams,
                                eos_token_id=self.tokenizer.sep_token_id,
                                pad_token_id=self.tokenizer.pad_token_id,
                                **model_kwargs)
for output in outputs:
    answer = self.tokenizer.decode(output, skip_special_tokens=True)
    print(answer)

-----------result--------------

sung shan shan gang gang gang gang gang gang
and and.......
a truck drives on the road past a utility pole and grassy hill
a snowboarder flies through the air while holding their board with one hand

希望您能告诉我正确的生成文本的方法。

-----------translation-------------- In your paper ‘Align before Fuse: Vision and Language Representation Learning with Momentum Distillation’ you mention visualisation of image-text pairs, showing the ALBEF model outputting text based on images, which suggests that ALBEF has this This suggests that ALBEF has this capability, but ALBEF only has a decoder in model_vqa.py, and I'd like to know how you generate the text? I borrowed the text generation method from BLIP article, and used the BERT pre-trained model in the transformers library in huggingface as text_decoder, the code is shown as below, but the generated result is very strange, it is always the same words, and although the loss has been lowered, the effect of the generated text is still very poor. I hope you can tell me the correct way to generate the text.