Closed mszsorondo closed 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Feature request
VisualBert takes 2 main inputs: tokenized text and tokenized images. The text tokenization can already be handled by the BertTokenizer, but the visual tokenization still has no support, and is no trivial task. This visual tokens are built with embeddings derived from a set of regions, each one corresponding to the region of a detected object of the image from an object detector. Here's a more detailed description of those embeddings from the paper:
As a tip, remember that some VisualBert checkpoints handle different visual embedding dimensions. You can use the examples from the model docs as a guide. Also note that, given that the embedding depends of an object detector, this should be an explicit parameter of the visual tokenizer since different detectors will perform differently.
Motivation
Building a visual embedding is conceptually simple, but implementing it is a tedious task, and there is no standard way to handle this directly with Transformers.
Your contribution
This issue arised while building the
DummyVisualBertInputGenerator
as a requisite for exporting the model to ONNX in Optimum. This is still in progress.