vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.76k stars 3.41k forks source link

c4ai-command-r-plus on 16GPUs #6207

Open thies1006 opened 2 weeks ago

thies1006 commented 2 weeks ago

Your current environment

vllm==0.4.3 numpy==1.26.4 nvidia-nccl-cu12==2.20.5 torch==2.3.0 transformers==4.41.2 triton==2.3.0

🐛 Describe the bug

I don't know if this is a bug or if the model just doesn't support this setup. I'm trying to run two machines with 16 L4 GPUs in total and I get this error:

[rank0]: ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=10005, ip=10.10.10.171, actor_id=92f69d368cc5f16efeca171f01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f9ada535240>)
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]:     raise e
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 243, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/commandr.py", line 379, in load_weights
[rank0]:     weight_loader(param, loaded_weight)
[rank0]:   File "/secondary/thies/.virtualenvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/commandr.py", line 84, in weight_loader
[rank0]:     loaded_weight = loaded_weight.narrow(shard_dim, start_idx,
[rank0]: RuntimeError: start (8) + length (1) exceeds dimension size (8).
youkaichao commented 2 weeks ago

currently every single dimension needs to be dividable by the tensor parallel size. in your case, i would suggest use tensor parallel size of 8 plus pipeline parallel size of 2. you need to use the latest vllm version to use pipeline parallel.

thies1006 commented 2 weeks ago

Thank you for the hint, this solved the original error (updated vLLM to master). But now I get NotImplementedError:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaModel', 'AquilaForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'Phi3ForCausalLM', 'GPT2LMHeadModel'].