Closed Se-Hun closed 1 month ago
@DarkLight1337 How to only use the language part of other multi-modality models? And what is the user interface? (Note that the language part of this model is not a standard llama)
@DarkLight1337 How to only use the language part of other multi-modality models? And what is the user interface? (Note that the language part of this model is not a standard llama)
I think what OP wants is to use the language-only variants of Llama 3.2 without having to load the vision model. This isn't really a thing for other multi-modality models as they are typically built on top of standard language models (e.g. Llama, Gemma), so you might as well just use the original HF repo for those language models if you only want to use the language part.
I think we just have to make sure MllamaForCausalLM
is added to the model registry, just like for a regular language model. Also, the model should implement load_weights
.
To enhance composibility, we can move the weight loading logic from MllamaForConditionalGeneration
to MllamaVisionModel
and MllamaForCausalLM
, and use AutoWeightsLoader
in MllamaForConditionalGeneration
to load the weights of the submodules.
By the way, there should be no need to set HF-specific attributes in vLLM such as config_class
, base_model_prefix
and _no_split_modules
.
But how should the users use this MllamaForCausalLM
class? When calling LLM(model='meta-llama/Llama-3.2-11B-Vision')
, we should return the MllamaForConditionalGeneration
class.
That is a good point. Perhaps for now, users can fork the model repository with the architecture
field set to MllamaForCausalLM
. Later we can work on a PR to let users override the architecture name in vLLM, which we already support for rope_scaling
and rope_theta
.
Actually, I just found that the text-only variants of Llama-3.2 use the regular LlamaForCausalLM
architecture (e.g. https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/config.json). @Se-Hun is this not sufficient for you?
@DarkLight1337 MllamaForConditionalGeneration
has additional text layers. For instance, meta-llama/Llama-3.2-11B-Vision-Instruct
(e.g. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model includes additional 9.7B text layers (unlike meta-llama/Llama-3.1-8B-Instruct
).
Therefore, I believe MllamaForCausalLM
derived from MllamaForConditionalGeneration
is different from LlamaForCausalLM
. Is it possible to make MllamaForCausalLM
available for these cases?
That is a good point. Perhaps for now, users can fork the model repository with the
architecture
field set toMllamaForCausalLM
. Later we can work on a PR to let users override the architecture name in vLLM, which we already support forrope_scaling
andrope_theta
.
Does this workaround sound good to you? If so, we can implement this.
Umm, sorry. I didn't understand your workaround. Can i get some examples ?
Umm, sorry. I didn't understand your workaround. Can i get some examples ?
I mean that we can create a PR to register MllamaForCausalLM
to vLLM. Afterwards, you can edit your local copy of the Llama-3.2 HF repository (or fork from it) and edit the architectures
field of config.json
to use MllamaForCausalLM
.
OK, It seems like that would work.
After your implementation, i will edit config and weights of Llama-3.2-vision models (e.g. meta-llama/Llama-3.2-11B-Vision-Instruct
).
@heheda12345 can you help with this? Thanks!
It is still unclear for me on how the user interface is like. In addition to normal text input, we also need the cross attention state which is the image embedding, and several masks. Therefore, even after registering MllamaForCausalLM
, I'm not sure how to allow users to use them. Current MllamaForCausalLM.forward
interface in vllm is like
input_ids: torch.LongTensor,
positions: Optional[torch.LongTensor],
cross_attention_states: Optional[torch.LongTensor],
cross_attention_mask: Optional[torch.LongTensor],
kv_range_for_decode: Optional[List[Tuple[int, int]]],
full_text_row_masked_out_mask: Optional[Tuple[torch.Tensor,
torch.Tensor]],
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
skip_cross_attention: bool,
It is still unclear for me on how the user interface is like. In addition to normal text input, we also need the cross attention state which is the image embedding, and several masks. Therefore, even after registering
MllamaForCausalLM
, I'm not sure how to allow users to use them. CurrentMllamaForCausalLM.forward
interface in vllm is likeinput_ids: torch.LongTensor, positions: Optional[torch.LongTensor], cross_attention_states: Optional[torch.LongTensor], cross_attention_mask: Optional[torch.LongTensor], kv_range_for_decode: Optional[List[Tuple[int, int]]], full_text_row_masked_out_mask: Optional[Tuple[torch.Tensor, torch.Tensor]], kv_caches: List[torch.Tensor], attn_metadata: AttentionMetadata, skip_cross_attention: bool,
I think for direct use of MllamaForCausalLM
, we can assume that no multi-modal data is being inputted, We can update MllamaForCausalLM.forward
to have same interface as MllamaForConditionalGeneration.forward
(except that multi-modal inputs are not allowed), while MllamaForConditionalGeneration.forward
can call MllamaForCausalLM.model.forward
directly with the cross-attention information.
If no multi-modal input is allowed, all cross attention layers are skipped and it becomes a standard llama. (Not sure whether it uses the same weight as llama). In that case, why not using the standard llama text model?
@DarkLight1337 MllamaForConditionalGeneration has additional text layers. For instance, meta-llama/Llama-3.2-11B-Vision-Instruct(e.g. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model includes additional 9.7B text layers (unlike meta-llama/Llama-3.1-8B-Instruct). Therefore, I believe MllamaForCausalLM derived from MllamaForConditionalGeneration is different from LlamaForCausalLM. Is it possible to make MllamaForCausalLM available for these cases?
OP mentioned that they wanted to use the model because of its larger size. (see above quote)
The 40 layers of llama 3.2 vision are 32 text layers (same as llama3.18b) and 8 cross attention layers(skipped if no multi-modality input). What is the "additional text layers" in your context? and what is the additional 9.7B parameters?
@heheda12345 Oh, that is a good point. I didn't know 8 cross attention layers are not used. So, the additional 9.7B parameters are missing. Thank you. I appreciate your excellent point and the effort you've put into our discussion. (@DarkLight1337)
🚀 The feature, motivation and pitch
MllamaForConditionalGeneration
models (such as,meta-llama/Llama-3.2-90B-Vision-Instruct
,meta-llama/Llama-3.2-11B-Vision
, etc.) are composed ofMllamaVisionModel
andMllamaForCausalLM
.I want to use only
MllamaForCausualLM
and this for, i can load model using code below.But at vllm, this feature is not supported. Is it possible to support these new function for
MllamaForCausalLM
for people like me who want to use only the text layer part of theMllamaForConditionalGeneration
model, without using the vision layer?Alternatives
No response
Additional context
No response
Before submitting a new issue...