vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.29k stars 4.59k forks source link

[Usage]:Can vllm use a method similar to device_map in transformers ? #6601

Open orderer0001 opened 4 months ago

orderer0001 commented 4 months ago

Your current environment

How would you like to use vllm

I have three 4090 GPUs with a total of 24*3 GB of memory, and the model I need to deploy requires at least 52 GB. The issue is that parallel deployment requires the number of GPUs to be divisible by 32, which is clearly not feasible. Can vllm use a method similar to device_map in transformers to specify how each layer is deployed to solve this problem?

youkaichao commented 4 months ago

you can use --pipeline-parallel-size 3 , see https://docs.vllm.ai/en/latest/serving/distributed_serving.html

orderer0001 commented 4 months ago

you can use --pipeline-parallel-size 3 , see https://docs.vllm.ai/en/latest/serving/distributed_serving.html Thank you for your guidance. Should I set the parameter pipeline-parallel-size to 3? Should tensor_parallel_size also be set to 3?

youkaichao commented 4 months ago

to be specific, it is --pipeline-parallel-size 3 --tensor_parallel_size 1 , the latter can be omitted as it is the default.

orderer0001 commented 4 months ago

pipeline-parallel-size After setting parameters, an error is reported: Traceback (most recent call last): File "", line 1, in File "/data/logs/drone-exec/envir6767/model_lib.py", line 263, in load_with_engine self.engine = LLMEngine.from_engine_args(EngineArgs.from_cli_args(self.args)) File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 385, in from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 670, in create_engine_config parallel_config = ParallelConfig( File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 698, in init self._verify_args() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 704, in _verify_args raise NotImplementedError("Pipeline parallelism is not supported " NotImplementedError: Pipeline parallelism is not supported yet with multiprocessing.

Related code: if 'gemma' in self.name.lower(): print("模型为gemma") self.args.pipeline_parallel_size =3 self.args.tensor_parallel_size =1 self.engine = LLMEngine.from_engine_args(EngineArgs.from_cli_args(self.args)) The above code is written into its own loading module.

orderer0001 commented 4 months ago

Even if I follow the instructions exactly as in the document, it still won’t work.

from vllm import LLM llm = LLM('/data/big_model/gemma-2-27b-it', pipeline_parallel_size =3) INFO 07-20 13:05:45 config.py:695] Defaulting to use mp for distributed inference Traceback (most recent call last): File "", line 1, in File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 150, in init self.llm_engine = LLMEngine.from_engine_args( File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 385, in from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 670, in create_engine_config parallel_config = ParallelConfig( File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 698, in init self._verify_args() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 704, in _verify_args raise NotImplementedError("Pipeline parallelism is not supported " NotImplementedError: Pipeline parallelism is not supported yet with multiprocessing.

youkaichao commented 4 months ago

it is a new feature, try to follow https://docs.vllm.ai/en/latest/getting_started/installation.html to install the latest main, or wait for the next release.

orderer0001 commented 4 months ago

it is a new feature, try to follow https://docs.vllm.ai/en/latest/getting_started/installation.html to install the latest main, or wait for the next release.

Is it possible to set the following parameter: distributed_executor_backend. After I set it up, from vllm import LLM llm = LLM('/data/big_model/gemma-2-27b-it', distributed_executor_backend = "ray",pipeline_parallel_size =3) I got another error: Traceback (most recent call last): File "", line 1, in File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 150, in init self.llm_engine = LLMEngine.from_engine_args( File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 385, in from_engine_args engine_config = engine_args.create_engine_config() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 670, in create_engine_config parallel_config = ParallelConfig( File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 698, in init self._verify_args() File "/root/anaconda3/envs/guihun_doc_aigc/lib/python3.10/site-packages/vllm/config.py", line 704, in _verify_args raise NotImplementedError("Pipeline parallelism is not supported " NotImplementedError: Pipeline parallelism is not supported yet with multiprocessing.

youkaichao commented 4 months ago

please give a minimal reproducible example with full log.

youkaichao commented 4 months ago

oh, one thing to note, pipeline_parallel_size is not supported in LLM. you need to use it through openai api server. please carefully read the doc https://docs.vllm.ai/en/latest/serving/distributed_serving.html .

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!