Open thies1006 opened 2 weeks ago
currently every single dimension needs to be dividable by the tensor parallel size. in your case, i would suggest use tensor parallel size of 8 plus pipeline parallel size of 2. you need to use the latest vllm version to use pipeline parallel.
Thank you for the hint, this solved the original error (updated vLLM to master). But now I get NotImplementedError:
NotImplementedError: Pipeline parallelism is only supported for the following architectures: ['AquilaModel', 'AquilaForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'Phi3ForCausalLM', 'GPT2LMHeadModel'].
Your current environment
vllm==0.4.3 numpy==1.26.4 nvidia-nccl-cu12==2.20.5 torch==2.3.0 transformers==4.41.2 triton==2.3.0
🐛 Describe the bug
I don't know if this is a bug or if the model just doesn't support this setup. I'm trying to run two machines with 16 L4 GPUs in total and I get this error: