[Feature]: supporting MllamaForCausalLM - Githubissues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

31.23k stars 4.75k forks source link

[Feature]: supporting MllamaForCausalLM #9479

Closed Se-Hun closed 1 month ago

Se-Hun commented 1 month ago

🚀 The feature, motivation and pitch

MllamaForConditionalGeneration models (such as, meta-llama/Llama-3.2-90B-Vision-Instruct, meta-llama/Llama-3.2-11B-Vision, etc.) are composed of MllamaVisionModel andMllamaForCausalLM.

I want to use only MllamaForCausualLM and this for, i can load model using code below.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

But at vllm, this feature is not supported. Is it possible to support these new function for MllamaForCausalLM for people like me who want to use only the text layer part of the MllamaForConditionalGeneration model, without using the vision layer?

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

heheda12345 commented 1 month ago

@DarkLight1337 How to only use the language part of other multi-modality models? And what is the user interface? (Note that the language part of this model is not a standard llama)

DarkLight1337 commented 1 month ago

@DarkLight1337 How to only use the language part of other multi-modality models? And what is the user interface? (Note that the language part of this model is not a standard llama)

I think what OP wants is to use the language-only variants of Llama 3.2 without having to load the vision model. This isn't really a thing for other multi-modality models as they are typically built on top of standard language models (e.g. Llama, Gemma), so you might as well just use the original HF repo for those language models if you only want to use the language part.

DarkLight1337 commented 1 month ago

I think we just have to make sure MllamaForCausalLM is added to the model registry, just like for a regular language model. Also, the model should implement load_weights.

To enhance composibility, we can move the weight loading logic from MllamaForConditionalGeneration to MllamaVisionModel and MllamaForCausalLM, and use AutoWeightsLoader in MllamaForConditionalGeneration to load the weights of the submodules.

DarkLight1337 commented 1 month ago

By the way, there should be no need to set HF-specific attributes in vLLM such as config_class, base_model_prefix and _no_split_modules.

heheda12345 commented 1 month ago

But how should the users use this MllamaForCausalLM class? When calling LLM(model='meta-llama/Llama-3.2-11B-Vision'), we should return the MllamaForConditionalGeneration class.

DarkLight1337 commented 1 month ago

That is a good point. Perhaps for now, users can fork the model repository with the architecture field set to MllamaForCausalLM. Later we can work on a PR to let users override the architecture name in vLLM, which we already support for rope_scaling and rope_theta.

DarkLight1337 commented 1 month ago

Actually, I just found that the text-only variants of Llama-3.2 use the regular LlamaForCausalLM architecture (e.g. https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/config.json). @Se-Hun is this not sufficient for you?

Se-Hun commented 1 month ago

@DarkLight1337 MllamaForConditionalGeneration has additional text layers. For instance, meta-llama/Llama-3.2-11B-Vision-Instruct(e.g. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model includes additional 9.7B text layers (unlike meta-llama/Llama-3.1-8B-Instruct). Therefore, I believe MllamaForCausalLM derived from MllamaForConditionalGeneration is different from LlamaForCausalLM. Is it possible to make MllamaForCausalLM available for these cases?

DarkLight1337 commented 1 month ago

That is a good point. Perhaps for now, users can fork the model repository with the architecture field set to MllamaForCausalLM. Later we can work on a PR to let users override the architecture name in vLLM, which we already support for rope_scaling and rope_theta.

Does this workaround sound good to you? If so, we can implement this.

Se-Hun commented 1 month ago

Umm, sorry. I didn't understand your workaround. Can i get some examples ?

DarkLight1337 commented 1 month ago

Umm, sorry. I didn't understand your workaround. Can i get some examples ?

I mean that we can create a PR to register MllamaForCausalLM to vLLM. Afterwards, you can edit your local copy of the Llama-3.2 HF repository (or fork from it) and edit the architectures field of config.json to use MllamaForCausalLM.

Se-Hun commented 1 month ago

OK, It seems like that would work. After your implementation, i will edit config and weights of Llama-3.2-vision models (e.g. meta-llama/Llama-3.2-11B-Vision-Instruct).

DarkLight1337 commented 1 month ago

@heheda12345 can you help with this? Thanks!

heheda12345 commented 1 month ago

It is still unclear for me on how the user interface is like. In addition to normal text input, we also need the cross attention state which is the image embedding, and several masks. Therefore, even after registering MllamaForCausalLM, I'm not sure how to allow users to use them. Current MllamaForCausalLM.forward interface in vllm is like

        input_ids: torch.LongTensor,
        positions: Optional[torch.LongTensor],
        cross_attention_states: Optional[torch.LongTensor],
        cross_attention_mask: Optional[torch.LongTensor],
        kv_range_for_decode: Optional[List[Tuple[int, int]]],
        full_text_row_masked_out_mask: Optional[Tuple[torch.Tensor,
                                                      torch.Tensor]],
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
        skip_cross_attention: bool,

DarkLight1337 commented 1 month ago

It is still unclear for me on how the user interface is like. In addition to normal text input, we also need the cross attention state which is the image embedding, and several masks. Therefore, even after registering MllamaForCausalLM, I'm not sure how to allow users to use them. Current MllamaForCausalLM.forward interface in vllm is like
        input_ids: torch.LongTensor,
        positions: Optional[torch.LongTensor],
        cross_attention_states: Optional[torch.LongTensor],
        cross_attention_mask: Optional[torch.LongTensor],
        kv_range_for_decode: Optional[List[Tuple[int, int]]],
        full_text_row_masked_out_mask: Optional[Tuple[torch.Tensor,
                                                      torch.Tensor]],
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
        skip_cross_attention: bool,

I think for direct use of MllamaForCausalLM, we can assume that no multi-modal data is being inputted, We can update MllamaForCausalLM.forward to have same interface as MllamaForConditionalGeneration.forward (except that multi-modal inputs are not allowed), while MllamaForConditionalGeneration.forward can call MllamaForCausalLM.model.forward directly with the cross-attention information.

heheda12345 commented 1 month ago

If no multi-modal input is allowed, all cross attention layers are skipped and it becomes a standard llama. (Not sure whether it uses the same weight as llama). In that case, why not using the standard llama text model?

DarkLight1337 commented 1 month ago

@DarkLight1337 MllamaForConditionalGeneration has additional text layers. For instance, meta-llama/Llama-3.2-11B-Vision-Instruct(e.g. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model includes additional 9.7B text layers (unlike meta-llama/Llama-3.1-8B-Instruct). Therefore, I believe MllamaForCausalLM derived from MllamaForConditionalGeneration is different from LlamaForCausalLM. Is it possible to make MllamaForCausalLM available for these cases?

OP mentioned that they wanted to use the model because of its larger size. (see above quote)

heheda12345 commented 1 month ago

The 40 layers of llama 3.2 vision are 32 text layers (same as llama3.18b) and 8 cross attention layers(skipped if no multi-modality input). What is the "additional text layers" in your context? and what is the additional 9.7B parameters?

Se-Hun commented 1 month ago

@heheda12345 Oh, that is a good point. I didn't know 8 cross attention layers are not used. So, the additional 9.7B parameters are missing. Thank you. I appreciate your excellent point and the effort you've put into our discussion. (@DarkLight1337)