huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.02k stars 27.02k forks source link

VisualBertTokenizer #21045

Closed mszsorondo closed 1 year ago

mszsorondo commented 1 year ago

Feature request

VisualBert takes 2 main inputs: tokenized text and tokenized images. The text tokenization can already be handled by the BertTokenizer, but the visual tokenization still has no support, and is no trivial task. This visual tokens are built with embeddings derived from a set of regions, each one corresponding to the region of a detected object of the image from an object detector. Here's a more detailed description of those embeddings from the paper:

Each embedding in F is computed by summing three embeddings:

f_o, a visual feature representation of the bounding region of f, computed by a convolutional neural network.
f_s, a segment embedding indicates it is an image embedding as opposed to a text embedding.
f_p, a position embedding, which is used when alignments between words and bounding regions are provided as part of the input, and set to the sum of the position embeddings corresponding to the aligned words.

As a tip, remember that some VisualBert checkpoints handle different visual embedding dimensions. You can use the examples from the model docs as a guide. Also note that, given that the embedding depends of an object detector, this should be an explicit parameter of the visual tokenizer since different detectors will perform differently.

Motivation

Building a visual embedding is conceptually simple, but implementing it is a tedious task, and there is no standard way to handle this directly with Transformers.

Your contribution

This issue arised while building the DummyVisualBertInputGenerator as a requisite for exporting the model to ONNX in Optimum. This is still in progress.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.