vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Feature]: Pipeline Parallelism support for LLaMA3.2 90B Vision Model #9015

Open sekh77 opened 1 month ago

sekh77 commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

I'm trying to load LLaMA 3.2 90B Vision model across two nodes. Each node has 2 A100 80GB GPUs. I'm using tensor parallel size=1 and pipeline parallel size = 4. I get the following not implemented error.

I'm using the latest published version of vLLM (version: 0.6.2). Any help to resolve this would be greatly received. Thank you.

raise NotImplementedError(

NotImplementedError: Pipeline parallelism is only supported for the following architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

Before submitting a new issue...

sekh77 commented 1 month ago

Command that I'm using to load the model:

vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4

DarkLight1337 commented 1 month ago

Yeah, PP is not supported for encoder-decoder models yet. See https://github.com/vllm-project/vllm/pull/7168#issuecomment-2391498161