vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.12k stars 3.28k forks source link

What's up with Pipeline Parallelism? #3314

Open duanzhaol opened 4 months ago

duanzhaol commented 4 months ago

Hey vllm team,

Hope you're all doing great! I‘m focusing on pipeline parallel inference and I hope it can be support on vllm.

I noticed that pipeline parallelism was on the old roadmap(#244) , but it's not on the new roadmap(#2681). Just curious, was there a specific reason you guys decided to skip it for now? Challenges with the implementation, or maybe it just didn't fit into the grand scheme of things at the moment?

Would love to get any insights or thoughts you have on this. I'm really looking forward to seeing where you take vllm next!

simon-mo commented 4 months ago

Currently we observe that the performance of Tensor Parallelism is more desirable than pipeline parallelism. Due to the lack of bandwidth, we dropped it from the current roadmap. We still welcome contribution!

duanzhaol commented 4 months ago

Currently we observe that the performance of Tensor Parallelism is more desirable than pipeline parallelism. Due to the lack of bandwidth, we dropped it from the current roadmap. We still welcome contribution!

Thanks,I believe that Pipeline Parallelism may offer improved throughput compared to Tensor Parallelism, albeit with a trade-off in latency. In certain situations, this approach could indeed be more practical. Additionally, I am currently working on implementing an asynchronous version of Pipeline Parallelism, which I can make a PR upon completion.

rkooo567 commented 4 months ago

Our internal work shows PP is actually help improving throughput of prefill stage because of low communication cost. I am excited to see the proposal!