I want to use VLBERT to fine-tune on multi-modal knowledge graph dataset, in which each entity has an image and a paragraph description. Can I use the pre-trained model to get the representation of each entity? And how do I load my dataset into the model? Thanks!
I want to use VLBERT to fine-tune on multi-modal knowledge graph dataset, in which each entity has an image and a paragraph description. Can I use the pre-trained model to get the representation of each entity? And how do I load my dataset into the model? Thanks!