VisualBertTokenizer - Githubissues

Feature request

VisualBert takes 2 main inputs: tokenized text and tokenized images. The text tokenization can already be handled by the BertTokenizer, but the visual tokenization still has no support, and is no trivial task. This visual tokens are built with embeddings derived from a set of regions, each one corresponding to the region of a detected object of the image from an object detector. Here's a more detailed description of those embeddings from the paper:

Each embedding in F is computed by summing three embeddings:

f_o, a visual feature representation of the bounding region of f, computed by a convolutional neural network.
f_s, a segment embedding indicates it is an image embedding as opposed to a text embedding.
f_p, a position embedding, which is used when alignments between words and bounding regions are provided as part of the input, and set to the sum of the position embeddings corresponding to the aligned words.

As a tip, remember that some VisualBert checkpoints handle different visual embedding dimensions. You can use the examples from the model docs as a guide. Also note that, given that the embedding depends of an object detector, this should be an explicit parameter of the visual tokenizer since different detectors will perform differently.

Motivation

Building a visual embedding is conceptually simple, but implementing it is a tedious task, and there is no standard way to handle this directly with Transformers.

Your contribution

This issue arised while building the DummyVisualBertInputGenerator as a requisite for exporting the model to ONNX in Optimum. This is still in progress.

huggingface / transformers

VisualBertTokenizer #21045

Feature request

Motivation

Your contribution