Closed irasin closed 1 month ago
Mark. Is pipeline parallel more efficient than Tensor Parallel in inference?
@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version, but it need the cost time of communication to be small enough. For devices which support nvlink,I do think tensor parallel is more efficient than pipeline parallel.
@KimmiShi , it depends on the gpu env you used. Tensor parallel can reduce the total end2end latency since the gemm size in the model becomes smaller than the full model version
Thanks, the e2e latency point of view is interesting.
The parallel_state.py
shows pipeline groups created, but are pipeline scheduling not supported yet?
Is there any progress about pipeline parallel now?
Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp
To support a model with pipeline prallel requires the following changes
But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.
Can you take a look at the code and give some comments?
can vllm support pipeline parallelism with multiple nodes ?
can vllm support pipeline parallelism with multiple nodes ?
Hi, @learninmou, I'm not familiar with ray's support for multi-node, but I think it should be easy to add multi-node-multi-device tp/pp support. A common practice is to use inter-nodes pp and intra-nodes tp for very large models.
Could someone please help me understand what is missing for pipeline parallel? It apparently has dead code in parallel_state.py
which is blocked by an exception in config.py
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/parallel_utils/parallel_state.py
pipeline_model_parallel_size: number of GPUs used for pipeline model
parallelism.
Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
the model pipeline. The present function will
create 4 tensor model-parallel groups and 2 pipeline model-parallel groups:
4 tensor model-parallel groups:
[g0, g1], [g2, g3], [g4, g5], [g6, g7]
2 pipeline model-parallel groups:
[g0, g2, g4, g6], [g1, g3, g5, g7]
Note that for efficiency, the caller should make sure adjacent ranks
are on the same DGX box. For example if we are using 2 DGX-1 boxes
with a total of 16 GPUs, rank 0 to 7 belong to the first box and
ranks 8 to 15 belong to the second box.
if self.pipeline_parallel_size > 1:
raise NotImplementedError(
"Pipeline parallelism is not supported yet.")
https://github.com/huggingface/transformers/issues/13690
Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:
- for the model to be nn.Sequential and
- inputs/outputs to be simple tensors with the first dimension of batch size.
So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.
https://huggingface.co/docs/transformers/v4.15.0/parallelism
Pipeline Parallel (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. ...
Problems with traditional Pipeline API solutions:
- have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a nn.Sequential sequence of the same, which may require changes to the design of the model.
- currently the Pipeline API is very restricted. If you had a bunch of python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693
- conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage.
- have to arrange each layer so that the output of one model becomes an input to the other model. ... 🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive PP support. The main obstacle is being unable to convert the models to nn.Sequential and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that.
Other approaches:
DeepSpeed, Varuna and SageMaker use the concept of an Interleaved Pipeline
Additionally, TensorRT-LLM has a pipeline parallel implementation (for their C++ backend).
Hi, @WoosukKwon I have supported blocking-type pipeline parallel of llama in my personal fork, https://github.com/irasin/vllm/tree/support_pp
To support a model with pipeline prallel requires the following changes
- The load weight and forward functions of each model need to support different pipeline stages.
- The worker needs to determine the input and output according to the pipeline stage.
But here's a big problem currently is that in a forward step, if we have three pipeline stages, the workers of stage1 and stage2 need to block until the workers of stage3 complete the inference, which causes a lot of time waste.
Can you take a look at the code and give some comments?
I wanted to inquire about the current state of your personal fork project. Is it functioning correctly at the moment? Have the issues you encountered been resolved? Additionally, I'm curious if you've conducted any tests to assess the actual effectiveness of pipeline parallelism .
For your information, my setup consists of 8 A800 PCIE GPUs, and I am running the llama 70b model.
Additionally, in my tests involving tensor parallelism, I observed that the throughput is higher with eight GPUs compared to four. This outcome puzzles me, as generally, the communication cost over PCIe is quite high. And I believe that pipeline parallelism would be more efficient for my needs.
@Lvjinhong,the pp works only for llama because I have no time to do the adaptation for other models My implementation is blocking-pp for different workers, so the performance is bad than tp.
For your setup, it's possible that tp8 performance is better than tp4. Because if you use larger tp size, the gemm size in each device will be smaller. The time saved by reducing gemm size is greater than the time increased by all reduce, so the final latency becomes smaller.
I do not recommand to use pp here since my original goal is for the case that if your device number is odd, like 3 gpus which can not run tp.
Did a bit more digging for some more reference pipeline parallel implementations, and tried to interpret how each works.
The deepspeed option seems much cleaner and more generic to me.
deepspeed.pipe.PipelineModule
)Method
PipelineModule
automatically manages data flowsLayerSpecs
Docs: https://deepspeed.readthedocs.io/en/latest/pipeline.html
Modules to be parallelized with pipeline parallelism.
The key constraint that enables pipeline parallelism is the representation of the forward pass as a sequence of layers and the enforcement of a simple interface between them. The forward pass is implicitly defined by the module layers. The key assumption is that the output of each layer can be directly fed as input to the next, like a torch.nn.Sequence. The forward pass is implicitly:
classdeepspeed.pipe.LayerSpec(typename, *module_args, **module_kwargs)[source] Building block for specifying pipeline-parallel modules.
LayerSpec stores the type information and parameters for each stage in a PipelineModule. For example:
nn.Sequence( torch.nn.Linear(self.in_dim, self.hidden_dim, bias=False), torch.nn.Linear(self.hidden_hidden, self.out_dim) )
becomes
layer_specs = [ LayerSpec(torch.nn.Linear, self.in_dim, self.hidden_dim, bias=False), LayerSpec(torch.nn.Linear, self.hidden_hidden, self.out_dim)] ]
Method:
pp_rank
.torch_recv_stream
/ torch_send_stream
which allows queueing up new workloads asyncronouslypre_node_rank
and post_node_rank
Pipeline parallel is supported now https://docs.vllm.ai/en/latest/serving/distributed_serving.html
I wonder will you support pipeline parallel in the future?If the answer is yes, maybe the whole system need to be designed again?