Closed 0xdevalias closed 2 months ago
They are doing interesting work with this, yes.
I replied in their twitter thread:
Super interesting work! Sounds like you use a strong model (opus/gpt-4o) to generate code changes and a weak model to "apply" them to the file? I've played with this for a ~year, but had concerns:
- Adding a 2nd inference step adds latency. Your work helps here!
- Can only apply edits to files that fit in the weak model's context & output token limits.
- Do you reliably get working code at the end? Success now depends on 2 LLMs not goofing up.
(2) is the biggest concern, since you can't edit large files. Did you evaluate (3)?
Figuring out (2) is a blocker for adopting something like this in aider, where editing large files is a huge benefit. And I'd want to evaluate (3) similar to the way all of aider's editing backends get benchmarked.
Beyond that is the need to actually fine tune and host such a model someplace and make it available to aider users. Since aider is an open source tool that carries a lot of operational overhead and costs that would need to get figured out. And weighed against what benefits.
Their reply on that thread (for context):
2 is definitely a concern and we’re working on solving this with long context extensions
- Our evals certainly could use work bc it depends on an LLM grader, rather than running the code/tests (like in aider’s benchmarks). We have a few ideas for improving things here
But we’ve found that letting the models output code in the format they know best (a standard chat response) works very well for planning single-file edits. Better than having the model directly make the change to the entire full file (and faster!). Then it’s just a question of making the apply models accuracy close to 100%, which isn’t terrible bc it is such a simple task
--
Figuring out (2) is a blocker for adopting something like this in aider, where editing large files is a huge benefit. And I'd want to evaluate (3) similar to the way all of aider's editing backends get benchmarked.
@paul-gauthier nods, yeah, that makes sense.
Beyond that is the need to actually fine tune and host such a model someplace and make it available to aider users. Since aider is an open source tool that carries a lot of operational overhead and costs that would need to get figured out. And weighed against what benefits.
@paul-gauthier nods, yeah, true. I think I originally didn't realise that it required a finetuned model/etc, as that seemed to just be for applying the diffs (which aider
already handles in it's own way); and thought that maybe the bulk of the benefits could be realised just through the 'speculative decoding' aspect; but to be fair, my knowledge in that space is super limited at best. A few resources I found on it:
In the Fast Lane! Speculative Decoding — 10x Larger Model, No Extra Cost
Speculative Decoding — Make LLM Inference Faster
In this blog, we’ll discuss about Speculative Decoding in detail which is a method to improve LLM inference speed by around 2–3X without degrading any accuracy. We’ll also look into implementing Speculative Decoding and and see how fast it is compared naive transformer implementation.
Fast inference from transformers via speculative decoding
This repository implements speculative sampling for large language model (LLM) decoding. It utilizes two models during the decoding process: a target model and an approximation model. The approximation model is a smaller model, while the target model is a larger one. The approximation model generates token guesses, and the target model corrects these guesses. This approach allows for decoding by running the target model in parallel on the outputs of the approximation models, resulting in improved efficiency compared to decoding with the target model alone.
The speculative sampling is proposed by Google and Deepmind independently. So I implement two slightly different versions of speculative sampling: Google's and Deepmind's.
Fast Inference from Transformers via Speculative Decoding
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
Accelerating Large Language Model Decoding with Speculative Sampling
We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.
Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this work, we perform a detailed study comprising over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate the factors that affect the performance gain provided by speculative decoding. Our experiments indicate that the performance of speculative decoding depends heavily on the latency of the draft model, and the draft model's capability in language modeling does not correlate strongly with its performance in speculative decoding. Based on these insights we explore a new design space for draft models and design hardware-efficient draft models for speculative decoding. Our newly designed draft model for LLaMA-65B can provide 60% higher throughput than existing draft models and can generalize further to the LLaMA-2 model family and supervised fine-tuned models.
I'm going to close this issue for now, but feel free to add a comment here and I will re-open or file a new issue any time.
and thought that maybe the bulk of the benefits could be realised just through the 'speculative decoding' aspect;
@paul-gauthier Just to confirm, in closing this, is it because you don't believe there are benefits to be gained from the speculative decoding aspect (irregardless of finetunes); or that it doesn't seem a good fit with how aider is currently setup to work; etc?
I believe speculative decoding would only be used by the actual code that is directly doing model inference. Aider calls out to other systems for inference, even with local models.
@paul-gauthier Yeah ok, that definitely makes sense. Thanks for clarifying :)
Issue
There was a recent thread/blogpost about cursor.sh's recent 'fast apply' changes:
In the thread, they made this comment related to
aider
's diffs vs speculative edits:Version and model info
N/A