huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.03k stars 26.55k forks source link

Implement Cross Attention in LLAMA Model #27285

Open eitamar-saraf opened 11 months ago

eitamar-saraf commented 11 months ago

Feature request

The current implementation of the LLAMA model in the Hugging Face Transformers repository supports self-attention layers as per the standard design of transformer models. I propose the addition of an option to use several or all attention layers as cross-attention layers instead of self-attention layers.

Cross-attention layers are crucial for tasks where the model needs to attend to different inputs other than its own output (e.g., encoder-decoder tasks in translation, image-captioning, etc.). The option to use cross-attention would enhance the LLAMA model's capabilities for a broader range of applications.

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py

Motivation

My motivation for this proposal stems from the need to apply the LLAMA model to tasks that inherently require cross-modal attention mechanisms. The current limitation of self-attention only restricts its applicability. While self-attention mechanisms are effective for a range of tasks, the flexibility of cross-attention layers could extend the model's utility, allowing researchers and developers to tackle a wider variety of problems.

Your contribution

I am willing to assist in the implementation of this feature. While I am not an expert in decoder-only architecture, with the right guidance, I can help.

I look forward to discussing this further with the maintainers of the repository.

Thank you for considering my proposal.

shankarsharma8089 commented 11 months ago

@amyeroberts @eitamar-saraf can i work on this ? i think Preparing separate embeddings for different modalities and Modifying query, key, and value matrices to attend to tokens might work

shankarsharma8089 commented 10 months ago

@amyeroberts can i work on this

amyeroberts commented 10 months ago

@shankarsharma8089 One thing to note is that Llama is a decoder-only model - which explains why it is implemented like so in in our modeling files. In general, we try to avoid changes which will complicate our forward passes or the model implementation. I don't know if there's any precedence for adapting existing models like this in the library cc @ArthurZucker who knows the LMs better than I do!

ydshieh commented 10 months ago

We do have this precedence, for example GPT-2. Even for encoder-only like Bert, we also allow to make it work as decoder, or even accept cross attention.

However, at this moment, not sure if we would like to do such stuff: if it bring a lot of extra value + considering we have other priorities.

ArthurZucker commented 10 months ago

Hey all! 🤗 We do have a precedent for a few models, because we tried to support them in the EncodeDecoderModel, which required this.

As both @amyeroberts and @ydshieh pointed out, this adds extra burden to the code and unless the community really needs that features and we find it impactful ( meaning someone successfully trained this kind of models), we'd rather not change the code of transformers for a specific custom usage.

I recommend you to simply share it on the hub, and add the link as a Llama ressources. It's not simple because the ROPE has to be adapted and there is no real theory behind it.

donthomasitos commented 7 months ago

@ArthurZucker Why shouldn't ROPE also work fine for the cross attention layers?

ArthurZucker commented 7 months ago

To the best of my knownledge there are no paper / reserach artifacts available with this + it's not made for encoders