image text relation classification

liusenling commented 7 months ago

May I ask whether you have deleted the task of image text relation classification in the original RpBERT code?

ivanhe123 commented 7 months ago

No I did not. Here's the Resnet encoder for encoding features of an image in modely.py:

def resnet_encode(model, x):
    x = model.conv1(x)
    x = model.bn1(x)
    x = model.relu(x)
    x = model.maxpool(x)

    x = model.layer1(x)
    x = model.layer2(x)
    x = model.layer3(x)
    x = model.layer4(x)

    x = x.view(x.size()[0], x.size()[1], -1)
    x = x.transpose(1, 2)

Here's here the model combines the resnet encoded input with the BERT embeddings in model.py:

def _bert_forward_with_image(self, inputs, datas, gate_signal=None):
        images = [data.image for data in datas]
        textual_embeds = self.encoder_t.embeddings.word_embeddings(inputs.input_ids)
        visual_embeds = torch.stack([image.data for image in images]).to(self.device)
        if not use_cache(self.encoder_v, images):
            visual_embeds = resnet_encode(self.encoder_v, visual_embeds)
        visual_embeds = self.proj(visual_embeds)
        if gate_signal is not None:
            visual_embeds *= gate_signal
        inputs_embeds = torch.concat((textual_embeds, visual_embeds), dim=1)

        batch_size = visual_embeds.size()[0]
        visual_length = visual_embeds.size()[1]

        attention_mask = inputs.attention_mask
        visual_mask = torch.ones((batch_size, visual_length), dtype=attention_mask.dtype, device=self.device)
        attention_mask = torch.cat((attention_mask, visual_mask), dim=1)

        token_type_ids = inputs.token_type_ids
        visual_type_ids = torch.ones((batch_size, visual_length), dtype=token_type_ids.dtype, device=self.device)
        token_type_ids = torch.cat((token_type_ids, visual_type_ids), dim=1)

        return self.encoder_t(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            return_dict=True
        )

liusenling commented 7 months ago

Thank you for your answer, you answered the text and image fusion. The RpBERT original paper has the task of image text relation classification, and the image text relation obtained from this task is used to assist the MNER task. If there is no image text classification, does that mean that it cannot be propagated based on relationships.

ivanhe123 commented 7 months ago

Yes, since there is no image to encode and concate into the input embedding for bert. The expected input and output shape will not match. Causing some troublesome errors.

liusenling commented 7 months ago

No, no, no, there is no error in the code. I just wonder whether there is no code for the text image classification task mentioned in the paper, and whether the text image relationship propagation in the title is valid without this task.

ivanhe123 commented 7 months ago

Without this task, the text-image relationship propagation will not be a valid title. The title should be: text propagation through GCN networks if there is not text-image relationship propagation through the GCN network or through RpBERT

liusenling commented 7 months ago

Does that mean it's no longer RpBERT, just a normal multimodal named entity recognition model with a layer of GCN？

ivanhe123 commented 7 months ago

yes

liusenling commented 7 months ago

Thank you for your answer, but I used your model to run my own data, and the effect has been quite ideal. Does it mean that the effect is better without the relationship between communication？

ivanhe123 commented 7 months ago

I do not know.

liusenling commented 7 months ago

Is the code of text image relation classification given in the original paper code? I did not find it.

ivanhe123 commented 7 months ago

no, only the model structure is provided.

liusenling commented 7 months ago

It seems that this task has been accomplished, and only the data set of the text-image relationship and the code to read that data set are provided.

ivanhe123 commented 7 months ago

Thank for you comment. I've read the paper again. You are right.

ivanhe123 / RpBERT-GCN-NER

image text relation classification #3