vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.48k stars 4.62k forks source link

Data parallel inference #1237

Closed kevinhu closed 2 months ago

kevinhu commented 1 year ago

Is there a recommended way to run data parallel inference (i.e. a copy of the model on each GPU)? It's possible by hacking CUDA_VISIBLE_DEVICES, but I was wondering if there's a cleaner method.

def worker(worker_idx):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(worker_idx)
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)

if __name__ == "__main__":

    with multiprocessing.Pool(4) as pool:
        pool.map(worker, range(4))
viktor-ferenczi commented 1 year ago

This approach should end up in a more scalable (maybe also cleaner) architecture:

Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler.

It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends.

It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations.

Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

viktor-ferenczi commented 1 year ago

Feature request: Allow for data-parallel execution on multiple (sets of) GPUs with the same model, served from the same API, so no external scheduler is required.

brucechin commented 1 year ago

This approach should end up in a more scalable (maybe also cleaner) architecture:

Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler.

It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends.

It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations.

Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

Hi @viktor-ferenczi , @LiuXiaoxuanPKU assigned this issue to me to offload some work. After reading your comments, I think to add the data-parallel inference on multiple GPUs with the same model, I can implement it according to your above suggestion?

  1. A centralized scheduler process to start multiple local vLLM API servers, each for each GPU with different ports.
  2. Implement some scheduling policies and restart API server when it detects any failure.
  3. If possible, we can support multi-server and each server with multi-GPU scenario to further improve the scalability.

I plan to add a new class : DataParallelScheduler which can start multiple vLLM API servers, manage them, and schedule incoming requests with the same generate interface. In the vllm/entrypoints/api_server.py, add an option for data-parallel inference, if enabled, instead of starting engine = AsyncLLMEngine.from_engine_args(engine_args), we can initialize a DataParallelScheduler instance for serving the request.

I will ensure that my change will not affect the old execution flow when the data-parallel inference option is disabled. I will also add tests to check the robustness of the scheduler I am going to add.

Please let me know if I miss anything here. I would like to add this feature support in my free time.

viktor-ferenczi commented 1 year ago

I'm not 100% sure that this functionality belongs to the vLLM engine project itself, because it is only a layer on top of it. Maybe using some existing external tool/framework to verify service health and configure it to restart the vLLM instances if required would be enough. All it needs is running a short generation as a health check one a minute (for example), so broken/frozen processes can be identified and restarted automatically.

You have the freedom to go ahead and make a PR for the solution your described.

SunLemuria commented 11 months ago

I think fastchat supports this feature: fastchat scalability

image
anisingh1 commented 9 months ago

Hi @brucechin, Are you working on implementing this request or has this been deferred?

AjayP13 commented 9 months ago

This is possible to do on our DataDreamer package which can load vLLM in parallel (different models on different GPUs). It does this by always instantiating vLLM in a background process and communicating with it. See ParallelLLM in the package for wrapping multiple VLLM objects under a single LLM object.

andakai commented 8 months ago

This approach should end up in a more scalable (maybe also cleaner) architecture: Run a vLLM API server for each GPU, serving on different ports. Then use those API endpoints to schedule generations on the vLLM backends from a centralized process. If you want to do it from Python, then try the vllm-client package (also available to install with pip), it supports async, making your logic simpler. It also allows restarting your "control" process during development (or on upgrades) without having to reload the model into the vLLM backends. It has a slight overhead due to API access, but that would amortize due to the parallel execution of generations. Make sure to keep the vLLM backend processes alive. Restart them if they crash or if they repeatedly fail client requests.

Hi @viktor-ferenczi , @LiuXiaoxuanPKU assigned this issue to me to offload some work. After reading your comments, I think to add the data-parallel inference on multiple GPUs with the same model, I can implement it according to your above suggestion?

  1. A centralized scheduler process to start multiple local vLLM API servers, each for each GPU with different ports.
  2. Implement some scheduling policies and restart API server when it detects any failure.
  3. If possible, we can support multi-server and each server with multi-GPU scenario to further improve the scalability.

I plan to add a new class : DataParallelScheduler which can start multiple vLLM API servers, manage them, and schedule incoming requests with the same generate interface. In the vllm/entrypoints/api_server.py, add an option for data-parallel inference, if enabled, instead of starting engine = AsyncLLMEngine.from_engine_args(engine_args), we can initialize a DataParallelScheduler instance for serving the request.

I will ensure that my change will not affect the old execution flow when the data-parallel inference option is disabled. I will also add tests to check the robustness of the scheduler I am going to add.

Please let me know if I miss anything here. I would like to add this feature support in my free time.

Hi, @brucechin , how is this work going? I am fascinated about this idea.

AmoghM commented 6 months ago

+1 for this feature.

zemerov commented 6 months ago

+1 for this feature.

WanBenLe commented 6 months ago

+1 for this feature, datadreamer seems couldn't Improve inference speed(with model copy)) of a single model on multiple GPUs

ifromeast commented 5 months ago

+1 for this feature. @WoosukKwon

GritLs commented 5 months ago

+1 for this feature

mangomatrix commented 5 months ago

need this feature too.

kota-iizuka commented 5 months ago

There is an example of using data parallel in https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py (changing num_instances properly controls the CUDA_VISIBLE_DEVICES environment variable as well).

On the other hand, since the above example is batch inference, I think there is still a need for a method of online inference (with proper load balancing) and a simple method for parallel inference of multiple models. (It is probably possible to achieve this with a single script in the examples/ directory, but it is important to make it easy to use.)

shizhediao commented 2 months ago

+1 for this feature.

zhaochenyang20 commented 2 months ago

https://github.com/zhaochenyang20/ModelServer

Could you please check this? This is my locally written one.

youkaichao commented 2 months ago

I'm going to close this issue, as vllm does not plan to support it.

users should seek third party support (which should be pretty easy to set up), e.g.:

https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

or the solutions mentioned in the above discussions.