VisionEncoderDecoderModel to work with CNN-based models

jbdel commented 1 year ago

Feature request

Hello,

VisionEncoderDecoderModel works only with vision-transformers-based models.

Typically, using a ResNet as encoder would trigger an error in the forward: TypeError: forward() got an unexpected keyword argument 'output_attentions'

I'm pretty sure making this pipeline work with CNN-based architecture would not be too much of a a change. As a matter of fact, adding **kwargs in the ResNet forward might be enough.

Motivation

Using CNN-based models with transformers-based language models in VisionEncoderDecoderModel

Your contribution

/

amyeroberts commented 1 year ago

Hi @jbdel, thanks for raising this issue!

The VisionEncoderDecoder class is specifically designed to work with transformer architectures, and the decoder model expects a transformer encoder output for its encoder_hidden_state. These are activations in the shape (batch_size, sequence_length, hidden_size) where each vector [i, j, :] represents the final activation for that input token/image patch. The ResNet model has a different kind of output: feature maps. As such, there are several incompatibilities beyond being able to pass the output_attentions argument to the encoder.

With all architectures coming out at a fast pace nowadays, it's not practical and realistic to make composite modeling like VisionEncoderDecoder to handle all pairs of encoder and decoder models. But the good thing is the code is open source, and everyone can make changes to it :).

If this is still something you are interested in, it could make an interesting question and project to share in the forums.

jbdel commented 1 year ago

Hello,

I beg to differ on your explanation. The output of a ResNet is not a different kind of output, it is also : (batch_size, sequence_length, hidden_size).

Call the vector [i, j, :] as you will: a token, an image patch, a slice of feature map, what matters in a pipeline is the compatibility of input/output, which is exactly what transformers and ResNet have in common.

As a matter of fact, the developers called the output of the ResNet "last_hidden_state": https://github.com/huggingface/transformers/blob/v4.27.2/src/transformers/models/resnet/modeling_resnet.py#L341

Architecture are surely coming out in a fast pace nowadays. Nonetheless this feature request is not about the latest fancy vision model published, but the very first architecture that enabled deep learning for computer vision.

Another thought: if huggingface is all about transformers, why implementing the resnet architecture available in torchvision ?

Finally, you suppose there will be several incompatibilities, again, i think not. A simple glance at the forward function of VisionEncoderDecoder shows you that the function cares only about the first output of the encoder: https://github.com/huggingface/transformers/blob/v4.27.2/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L602 Which is exactly what ResNet provides.

NielsRogge commented 1 year ago

Hi,

You can use ResNet with the vision encoder-decoder framework, although it might not work out-of-the-box as shown by your first message (for the moment that requires forking the library and making the required changes). ResNets, like other CNNs, output feature maps of shape (batch_size, num_channels, height, width), so they are by default 4D instead of 3D with the regular last_hidden_state of a model like ViT. See here for an example: the final feature map is of shape (batch_size, 2048, 7, 7) for a 224x224 image.

However you can of course reshape the final feature map to get a 3D tensor which can be used for cross-attention with the decoder. This can be achieved by doing:

batch_size, num_channels, height, width = last_hidden_state.shape
last_hidden_state = last_hidden_state.permute(0, 2, 3, 1)
last_hidden_state = last_hidden_state.reshape(batch_size, height*width, num_channels)

The reason ResNet is present in the library is because it is used as backbone for several Transformer-based frameworks like DETR, MaskFormer and Mask2Former, all of which are also available in the library.

jbdel commented 1 year ago

Hello.

Thank you for your answer.

I do understand there is a straightforward way to modify the code so that you can have a resnet to transformer pipeline using huggingface.

I have submitted this as a feature request, with the hope that it will be considered for addition to the official library implementation. This would allow you to use that pipeline on the Huggingface hub.

Have a good day,

JB

NielsRogge commented 1 year ago

I'll mark this request as a "good first issue" as I don't have the bandwidth for this atm.

However for this to work we would need to maintain a mapping which lists the models that output a 4D feature map, to make sure we permute and reshape the final hidden state as shown above. Additionally we need to take into account that some of those models don't accept an output_attentions keyword argument.

sgugger commented 1 year ago

I do not think this an issue that would be easy to tackle by a beginner so I have removed the "Good first issue" label. Having issues that are too hard labeled like this often backfires and make beginners stop contributing instead of feeling empowered.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers