jafioti / luminal

Deep learning at the speed of light.
https://luminalai.com
Apache License 2.0
1.39k stars 86 forks source link

[feature suggestion] self speculative decoding #61

Open NewBornRustacean opened 1 month ago

NewBornRustacean commented 1 month ago

Good morning(or afternoon/ evening)!

There is a methodology called self speculative decoding among the techniques to enhance the speed of LLM inference. Would it be possible to implement this feature in Luminal? If it aligns with Luminal's philosophy, I believe this type of work could greatly contribute to speed improvement! Even though it's not included in the v0.3 roadmap, I'd like to start this task slowly if it's alright.

Summary of abstract This paper introduces self-speculative decoding, a novel inference scheme designed to accelerate Large Language Models (LLMs) without relying on auxiliary models. It operates in two stages: drafting, which quickly generates draft tokens by selectively skipping intermediate layers, and verification, which validates draft output using the original LLM in a single forward pass. The approach maintains output quality identical to that of unaltered LLMs, without requiring additional neural network training or extra memory, offering a plug-and-play and cost-effective solution for inference acceleration, with benchmarks showing speedups of up to 1.73× on LLaMA-2 and its fine-tuned models.

jafioti commented 1 month ago

Yes self-speculative decoding is something I've been interested in for a while. I think it's entirely possible, though I'm a little fuzzy on how the layers are actually chosen for the draft pass. If the skipped layers are fixed, as in they don't change between draft passes, then I think it's very straightforward. You can essentially just have a forward() and forward_draft() on the module to do each pass.

I would suggest using two graphs, one for the normal forward pass and one for the draft pass. During inference you can quicky move weights between graphs with the transfer_data() function.

I'm super excited to see where this goes. Lmk if you have any questions! I'd be happy to help

NewBornRustacean commented 1 month ago

Thanks! @jafioti

I'm gonna start with two graphs as you suggested.

btw, where do you think is the right place for this feature to be merged? creating generation.rs either in luminal/crates/luminal_nn/src/ or luminal/crates/luminal_nn/src/transformer seems possible, I guess.

jafioti commented 1 month ago

I would suggest for now just making it an example (copy llama and rename it llama_speculative) and work off that until it takes shape. Once it works we can see how it fits into the whole ecosystem.

NewBornRustacean commented 1 month ago

Good morning! @jafioti

According to original implementation of the paper(draft & verify:Lossless Large Language Model Acceleration via Self-Speculative Decoding), the skipped layers are chosen in advance(gaussian process; I think this is not suitable for runtime execution).

It seems like the Bayesian optimization discussed in the paper is implemented using the Python library "bayes_opt". Once the skipped layers are determined, it doesn't seem like they change during the draft pass. However, in my opinion, depending on the type of LLM model and the prompt, the skipped layers could potentially change. So, I'm thinking of implementing a function that takes skipped layers as input for now(maybe const generic).

jafioti commented 1 month ago

Sure sounds good👍

jafioti commented 3 weeks ago

Hey @NewBornRustacean , how's it been going with speculative decoding? If you need any help, feel free to reach out. Happy to talk if you're stuck on anything.

NewBornRustacean commented 3 weeks ago

Thanks for your comment! Actually, I've been so busy with work for the past few weeks that I haven't been able to get much done. This week is a holiday in Korea, so I think I'll finally have some time! If I run into any difficulties, I'll ask for help right away.

Btw, the slides you shared(at disocrd) were very helpful to understand the concepts.

I've been reading the Mirage paper you shared during my commute, and although it's difficult, I find it interesting. I want to learn more about this topic so I can contribute more to Luminal!