vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.17k stars 4.16k forks source link

pipeline parallel support in the future? #387

Closed irasin closed 1 month ago

irasin commented 1 year ago

I wonder will you support pipeline parallel in the future?If the answer is yes, maybe the whole system need to be designed again?

KimmiShi commented 1 year ago

Mark. Is pipeline parallel more efficient than Tensor Parallel in inference?

irasin commented 1 year ago

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version, but it need the cost time of communication to be small enough. For devices which support nvlink,I do think tensor parallel is more efficient than pipeline parallel.

KimmiShi commented 1 year ago

@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version

Thanks, the e2e latency point of view is interesting.

esaliya commented 1 year ago

The parallel_state.py shows pipeline groups created, but are pipeline scheduling not supported yet?

irasin commented 1 year ago

Is there any progress about pipeline parallel now?

irasin commented 10 months ago

Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

  1. The load weight and forward functions of each model need to support different pipeline stages.
  2. The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

learninmou commented 10 months ago

can vllm support pipeline parallelism with multiple nodes ?

irasin commented 10 months ago

can vllm support pipeline parallelism with multiple nodes ?

Hi, @learninmou, I'm not familiar with ray's support for multi-node, but I think it should be easy to add multi-node-multi-device tp/pp support. A common practice is to use inter-nodes pp and intra-nodes tp for very large models.

lapp0 commented 10 months ago

Could someone please help me understand what is missing for pipeline parallel? It apparently has dead code in parallel_state.py which is blocked by an exception in config.py

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/parallel_utils/parallel_state.py

        pipeline_model_parallel_size: number of GPUs used for pipeline model
            parallelism.

    Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
    use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
    the model pipeline. The present function will
    create 4 tensor model-parallel groups and 2 pipeline model-parallel groups:
        4 tensor model-parallel groups:
            [g0, g1], [g2, g3], [g4, g5], [g6, g7]
        2 pipeline model-parallel groups:
            [g0, g2, g4, g6], [g1, g3, g5, g7]
    Note that for efficiency, the caller should make sure adjacent ranks
    are on the same DGX box. For example if we are using 2 DGX-1 boxes
    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
    ranks 8 to 15 belong to the second box.

https://github.com/vllm-project/vllm/blob/cb3f30c600169210f9715f084e34adf2afc4f7d7/vllm/config.py#L340

        if self.pipeline_parallel_size > 1:
            raise NotImplementedError(
                "Pipeline parallelism is not supported yet.")
lapp0 commented 10 months ago

https://github.com/huggingface/transformers/issues/13690

Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:

  • for the model to be nn.Sequential and
  • inputs/outputs to be simple tensors with the first dimension of batch size.

So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.

https://huggingface.co/docs/transformers/v4.15.0/parallelism

Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. ...

Problems with traditional Pipeline API solutions:

  • have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a nn.Sequential sequence of the same, which may require changes to the design of the model.
  • currently the Pipeline API is very restricted. If you had a bunch of python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693
  • conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage.
  • have to arrange each layer so that the output of one model becomes an input to the other model. ... 🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive PP support. The main obstacle is being unable to convert the models to nn.Sequential and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that.

Other approaches:

DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline

Additionally, TensorRT-LLM has a pipeline parallel implementation (for their C++ backend).

Lvjinhong commented 9 months ago

Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp

To support a model with pipeline prallel requires the following changes

  1. The load weight and forward functions of each model need to support different pipeline stages.
  2. The worker needs to determine the input and output according to the pipeline stage.

But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.

Can you take a look at the code and give some comments?

I wanted to inquire about the current state of your personal fork project. Is it functioning correctly at the moment? Have the issues you encountered been resolved? Additionally, I'm curious if you've conducted any tests to assess the actual effectiveness of pipeline parallelism .

For your information, my setup consists of 8 A800 PCIE GPUs, and I am running the llama 70b model.

Additionally, in my tests involving tensor parallelism, I observed that the throughput is higher with eight GPUs compared to four. This outcome puzzles me, as generally, the communication cost over PCIe is quite high. And I believe that pipeline parallelism would be more efficient for my needs.

irasin commented 9 months ago

@Lvjinhong,the pp works only for llama because I have no time to do the adaptation for other models My implementation is blocking-pp for different workers, so the performance is bad than tp.

For your setup, it's possible that tp8 performance is better than tp4. Because if you use larger tp size, the gemm size in each device will be smaller. The time saved by reducing gemm size is greater than the time increased by all reduce, so the final latency becomes smaller.

I do not recommand to use pp here since my original goal is for the case that if your device number is odd, like 3 gpus which can not run tp.

lapp0 commented 9 months ago

Did a bit more digging for some more reference pipeline parallel implementations, and tried to interpret how each works.

The deepspeed option seems much cleaner and more generic to me.

Deepspeed (there are a few examples using deepspeed.pipe.PipelineModule)

Method

Docs: https://deepspeed.readthedocs.io/en/latest/pipeline.html

Modules to be parallelized with pipeline parallelism.

The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:

classdeepspeed.pipe.LayerSpec(typename, *module_args, **module_kwargs)[source] Building block for specifying pipeline-parallel modules.

LayerSpec stores the type information and parameters for each stage in a PipelineModule. For example:

   nn.Sequence(
       torch.nn.Linear(self.in_dim, self.hidden_dim, bias=False),

    torch.nn.Linear(self.hidden_hidden, self.out_dim)
)

becomes

layer_specs = [
    LayerSpec(torch.nn.Linear, self.in_dim, self.hidden_dim, bias=False),
    LayerSpec(torch.nn.Linear, self.hidden_hidden, self.out_dim)]
]

Alternatively there is Together.AI's OpenChatKit

Method:

hmellor commented 1 month ago

Pipeline parallel is supported now https://docs.vllm.ai/en/latest/serving/distributed_serving.html