paul-gauthier / aider

aider is AI pair programming in your terminal
https://aider.chat/
Apache License 2.0
17.63k stars 1.65k forks source link

[FEATURE] add loop capability to enhance cheap model performance #926

Open Teskh opened 1 month ago

Teskh commented 1 month ago

Issue

cheap models (like 4o-mini) seem to get comparable performance to SOTA at cheaper price when n simultaneous attempts are run in parallel (each attempt gets fed into itself with the prompt "make it better" m times). An evaluator then picks the output it likes best, which is then applied. image

Reference: https://www.youtube.com/watch?v=0Z2BQPuUY50&t=979s

pd: loving aider. thanks a lot.

Version and model info

Aider v0.45.1

randoentity commented 1 month ago

LLM generated TL; DR:

Original rant: As a local-first user this looks very interesting. If I understood correctly, running multiple request in parallel is very efficient. I wonder how well this would work with smaller models like codestral or even 8b coding models.

It shouldn't be too difficult to get a POC of this plugged into Aider. And it's an excuse to finally play with a batching-capable serving engine. Now to find some time...

Edit: just remembered Wilmer, which might be a good option to avoid making (too many) changes to Aider. Individual tools should be good at their one specific task and all that. https://github.com/SomeOddCodeGuy/WilmerAI/

Rabbit hole edit: I was thinking about using grammars to force the model to stick to the requested format better. I'm not sure how GGUF does it, but with exllamav2 it looks very feasible to implement this with schemas, output limiting. Aider already evaluates the output pretty well. https://github.com/turboderp/exllamav2/blob/master/examples/inference_json.py

Note how we could, for example, build a Literal list of possible file paths.

Then during the coder phase we could ban things like "rest of code" (there's also an example for that). It might even be possible to loop when a phrase is banned until all possible variations of an unwanted phrase are banned. Unfortunately some (most maybe?) models are so stubborn that they'll just output gibberish when there is no embedding available for what they intend to say. I can only think of two ways around this:

Exllamav2 also has some batching examples. I'm excited.