Open sekh77 opened 1 month ago
Command that I'm using to load the model:
vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
Yeah, PP is not supported for encoder-decoder models yet. See https://github.com/vllm-project/vllm/pull/7168#issuecomment-2391498161
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ```Model Input Dumps
No response
🐛 Describe the bug
I'm trying to load LLaMA 3.2 90B Vision model across two nodes. Each node has 2 A100 80GB GPUs. I'm using tensor parallel size=1 and pipeline parallel size = 4. I get the following not implemented error.
I'm using the latest published version of vLLM (version: 0.6.2). Any help to resolve this would be greatly received. Thank you.
NotImplementedError: Pipeline parallelism is only supported for the following architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].
Before submitting a new issue...