Open liusenling opened 7 months ago
No I did not. Here's the Resnet encoder for encoding features of an image in modely.py:
def resnet_encode(model, x):
x = model.conv1(x)
x = model.bn1(x)
x = model.relu(x)
x = model.maxpool(x)
x = model.layer1(x)
x = model.layer2(x)
x = model.layer3(x)
x = model.layer4(x)
x = x.view(x.size()[0], x.size()[1], -1)
x = x.transpose(1, 2)
Here's here the model combines the resnet encoded input with the BERT embeddings in model.py:
def _bert_forward_with_image(self, inputs, datas, gate_signal=None):
images = [data.image for data in datas]
textual_embeds = self.encoder_t.embeddings.word_embeddings(inputs.input_ids)
visual_embeds = torch.stack([image.data for image in images]).to(self.device)
if not use_cache(self.encoder_v, images):
visual_embeds = resnet_encode(self.encoder_v, visual_embeds)
visual_embeds = self.proj(visual_embeds)
if gate_signal is not None:
visual_embeds *= gate_signal
inputs_embeds = torch.concat((textual_embeds, visual_embeds), dim=1)
batch_size = visual_embeds.size()[0]
visual_length = visual_embeds.size()[1]
attention_mask = inputs.attention_mask
visual_mask = torch.ones((batch_size, visual_length), dtype=attention_mask.dtype, device=self.device)
attention_mask = torch.cat((attention_mask, visual_mask), dim=1)
token_type_ids = inputs.token_type_ids
visual_type_ids = torch.ones((batch_size, visual_length), dtype=token_type_ids.dtype, device=self.device)
token_type_ids = torch.cat((token_type_ids, visual_type_ids), dim=1)
return self.encoder_t(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
return_dict=True
)
Thank you for your answer, you answered the text and image fusion. The RpBERT original paper has the task of image text relation classification, and the image text relation obtained from this task is used to assist the MNER task. If there is no image text classification, does that mean that it cannot be propagated based on relationships.
Yes, since there is no image to encode and concate into the input embedding for bert. The expected input and output shape will not match. Causing some troublesome errors.
No, no, no, there is no error in the code. I just wonder whether there is no code for the text image classification task mentioned in the paper, and whether the text image relationship propagation in the title is valid without this task.
Without this task, the text-image relationship propagation will not be a valid title. The title should be: text propagation through GCN networks if there is not text-image relationship propagation through the GCN network or through RpBERT
Does that mean it's no longer RpBERT, just a normal multimodal named entity recognition model with a layer of GCN?
yes
Thank you for your answer, but I used your model to run my own data, and the effect has been quite ideal. Does it mean that the effect is better without the relationship between communication?
I do not know.
Is the code of text image relation classification given in the original paper code? I did not find it.
no, only the model structure is provided.
It seems that this task has been accomplished, and only the data set of the text-image relationship and the code to read that data set are provided.
Thank for you comment. I've read the paper again. You are right.
May I ask whether you have deleted the task of image text relation classification in the original RpBERT code?