[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model

cadedaniel commented 5 months ago

Overview

Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2-3x speedup for bs=1, in Anyscale's fork we see up to 2x speedup with a small draft model for bs=8 (30% for bs=16) (we can improve this! see https://github.com/vllm-project/vllm/issues/4630 if you want to help).

A key optimization for small models (68m/160m domain) is to use tensor-parallel degree 1, even if the target model is using tensor-parallel degree 4 or 8. In our fork, this reduces proposal time from 5ms/tok to 1.5ms/tok. This will allow a well-aligned 68m draft model to get 2x per-user throughput improvement on 70B target model.

Furthermore, a 1B/7B proposer model may ideally be placed on TP=2 or TP=4, while the larger model is placed on TP=8. vLLM should support these configuration so the community can use the configuration best for their draft model.

Design suggestions

I implemented a Worker which patches the tensor parallel group to TP1 in our fork. The code is dumped here. We should use this approach in vLLM, however we can improve it by using @youkaichao 's tensor-parallel group improvements.

youkaichao commented 5 months ago

I can work on this after a major refactor of distributed https://github.com/vllm-project/vllm/pull/4591 is landed.

wooyeonlee0 commented 4 months ago

@cadedaniel Can I contribute my code that already implemented this feature on v0.4.2? I've referred to your code in #2188.

I'm aware that #4933 is going on, so I want to confirm that it's okay to do it.

GeauxEric commented 4 months ago

@wooyeonlee0 pls go ahead.

cadedaniel commented 4 months ago

yep, my policy is to review the PRs in the order that they're initially ready for review. go ahead @wooyeonlee0 .

wooyeonlee0 commented 4 months ago

Thanks for the answer :) I'll send a PR maybe next week.

vllm-project / vllm

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632

Overview

Design suggestions