[FEATURE] add loop capability to enhance cheap model performance

LLM generated TL; DR:

The idea of running multiple requests in parallel is interesting and potentially efficient, even for smaller models like Codestral or 8b coding models
Integrating this concept with Aider could be done as a proof-of-concept (POC) using a batching-capable serving engine like vLLM and Wilmer as middleware
Using grammars to force the model to stick to the requested format is also an idea worth exploring, possibly through the use of schemas and output limiting with examples from Exllamav2
Challenges include models producing gibberish when faced with banned phrases, but potential solutions include saving correct output and generating new prompts or passing omitted code as input in continuation mode
Implementing these solutions as modular extensions to Aider could make it easier to maintain and contribute to the project

Original rant: As a local-first user this looks very interesting. If I understood correctly, running multiple request in parallel is very efficient. I wonder how well this would work with smaller models like codestral or even 8b coding models.

It shouldn't be too difficult to get a POC of this plugged into Aider. And it's an excuse to finally play with a batching-capable serving engine. Now to find some time...

Edit: just remembered Wilmer, which might be a good option to avoid making (too many) changes to Aider. Individual tools should be good at their one specific task and all that. https://github.com/SomeOddCodeGuy/WilmerAI/

Rabbit hole edit: I was thinking about using grammars to force the model to stick to the requested format better. I'm not sure how GGUF does it, but with exllamav2 it looks very feasible to implement this with schemas, output limiting. Aider already evaluates the output pretty well. https://github.com/turboderp/exllamav2/blob/master/examples/inference_json.py

Note how we could, for example, build a Literal list of possible file paths.

Then during the coder phase we could ban things like "rest of code" (there's also an example for that). It might even be possible to loop when a phrase is banned until all possible variations of an unwanted phrase are banned. Unfortunately some (most maybe?) models are so stubborn that they'll just output gibberish when there is no embedding available for what they intend to say. I can only think of two ways around this:

save correct output from before the ban and generate a new prompt (e.g. "literally copy the code from this point")
pass (part of) the code that is being omitted from the output as input once a token/phrase ban is detected (same as above, but we rely less on the model and more on our own scripting) Both of these will likely be a headache to maintain because there are many different variations of possible errors. Maybe we could implement these as modular extensions to Aider, so users can mix and match based on model (at least). It will also make it easier to contribute to this project.

Exllamav2 also has some batching examples. I'm excited.

paul-gauthier / aider

[FEATURE] add loop capability to enhance cheap model performance #926

Issue

Version and model info