[Feature]: Pipeline Parallelism support for LLaMA3.2 90B Vision Model

sekh77 commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

I'm trying to load LLaMA 3.2 90B Vision model across two nodes. Each node has 2 A100 80GB GPUs. I'm using tensor parallel size=1 and pipeline parallel size = 4. I get the following not implemented error.

I'm using the latest published version of vLLM (version: 0.6.2). Any help to resolve this would be greatly received. Thank you.

raise NotImplementedError(

NotImplementedError: Pipeline parallelism is only supported for the following architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

sekh77 commented 1 month ago

Command that I'm using to load the model:

vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4

DarkLight1337 commented 1 month ago

Yeah, PP is not supported for encoder-decoder models yet. See https://github.com/vllm-project/vllm/pull/7168#issuecomment-2391498161

vllm-project / vllm