Thanks for the initial implementation. The speed-up results look great.
Just wondering though - are there any stats / results that compare with Medusa, since I see that Medusa doesn't need a draft model either, and involves guessing / predicting future tokens at the current step. I understand that Medusa might need finetuning for the heads to predict future tokens, while Lookahead doesn't, but assuming we have finetuned heads (minimal cost considering a frozen quantized base model), does Lookahead provide huge improvements over Medusa?
Especially trying to understand how Lookahead-Decoding fares in terms of speed-up and memory-consumption, compared to Medusa. (I see that since Lookahead is an exact decoding and not an approximation, qualitative-performance will be same as the original base model, and might hence be better than Medusa)
So any info can help :)
Thanks for the initial implementation. The speed-up results look great.
Just wondering though - are there any stats / results that compare with Medusa, since I see that Medusa doesn't need a draft model either, and involves guessing / predicting future tokens at the current step. I understand that Medusa might need finetuning for the heads to predict future tokens, while Lookahead doesn't, but assuming we have finetuned heads (minimal cost considering a frozen quantized base model), does Lookahead provide huge improvements over Medusa?
Especially trying to understand how Lookahead-Decoding fares in terms of speed-up and memory-consumption, compared to Medusa. (I see that since Lookahead is an exact decoding and not an approximation, qualitative-performance will be same as the original base model, and might hence be better than Medusa) So any info can help :)
Thanks in advance!