hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
https://arxiv.org/abs/2402.02057
Apache License 2.0
1.15k stars 67 forks source link

Benchmarks comparing with Medusa #26

Open Rock-Anderson opened 12 months ago

Rock-Anderson commented 12 months ago

Thanks for the initial implementation. The speed-up results look great.

Just wondering though - are there any stats / results that compare with Medusa, since I see that Medusa doesn't need a draft model either, and involves guessing / predicting future tokens at the current step. I understand that Medusa might need finetuning for the heads to predict future tokens, while Lookahead doesn't, but assuming we have finetuned heads (minimal cost considering a frozen quantized base model), does Lookahead provide huge improvements over Medusa?

Especially trying to understand how Lookahead-Decoding fares in terms of speed-up and memory-consumption, compared to Medusa. (I see that since Lookahead is an exact decoding and not an approximation, qualitative-performance will be same as the original base model, and might hence be better than Medusa) So any info can help :)

Thanks in advance!