The most no-nonsense, locally or API-hosted AI code completion plugin for Visual Studio Code - like GitHub Copilot but completely free and 100% private.
Is your feature request related to a problem? Please describe.
When editing the beginning of a long file, prompt evaluation takes a lot of time.
Reason for that - in Additional context
Currently we send similar amount of lines from top and bottom. I believe that we have reasons to make the bottom part smaller:
It takes a long time to reevaluate bottom lines
Bottom lines often aren't as important (IMO). This way we can have more context window left for top lines.
Describe the solution you'd like
I want to have separated options for Context Length for 'before' and 'after'.
Describe alternatives you've considered
Or maybe leave current Twinny: Context Length as is, but add optional override for bottom lines.
Additional context
For context:
AFAIK (this is mostly based on my assumptions), llama.cpp doesn't have to reevaluate prefix part of prompt that haven't changed since last generation. But the moment it encounters a change - it will start reevaluating everything after that change.
So when we have 2 requests in a row with prompts:
It won't have to spend time on evaluating import numpy.
However, it will still have to run everything after <|fim▁hole|> (because it only checks for prefix in prompt).
(Example of llama.cpp output (not for this exact case): Llama.generate: 2978 prefix-match hit, remaining 8 prompt tokens to eval)
Is your feature request related to a problem? Please describe. When editing the beginning of a long file, prompt evaluation takes a lot of time. Reason for that - in
Additional context
Currently we send similar amount of lines from top and bottom. I believe that we have reasons to make the bottom part smaller:
Describe the solution you'd like I want to have separated options for Context Length for 'before' and 'after'.
Describe alternatives you've considered Or maybe leave current
Twinny: Context Length
as is, but add optional override for bottom lines.Additional context For context: AFAIK (this is mostly based on my assumptions), llama.cpp doesn't have to reevaluate prefix part of prompt that haven't changed since last generation. But the moment it encounters a change - it will start reevaluating everything after that change. So when we have 2 requests in a row with prompts:
It won't have to spend time on evaluating
import numpy
. However, it will still have to run everything after<|fim▁hole|>
(because it only checks for prefix in prompt). (Example of llama.cpp output (not for this exact case):Llama.generate: 2978 prefix-match hit, remaining 8 prompt tokens to eval
)