[FR] Send other buffers as context

RomeoV commented 5 months ago

Awesome library. I've been able to successfully set it up with a remote running llamafile, awesome!

I would love to use this as a coding assistant a la copilot etc. For this, it is usually really useful for the model if it has access to the rest of the buffer, and possibly other buffers, as context.

Legend has it that the VSCode copilot plugin send the 20 most recent buffers as context (citation needed). Would it be possible to include something similar here aswell?

karthink commented 5 months ago

I would love to use this as a coding assistant a la copilot etc.

gptel isn't good enough for this use case. It starts an external process for each request and only does http requests (no websockets or local process-filter support), so I suspect that it's not going to be fast enough to be a coding assistant.

For speed, a good copilot-like solution will probably involve an LLM inference process that acts as an LSP server -- I can't find it right now but I think this kind of thing exists already? It was python-based, as I recall.
I'm pretty sure the org-ai package does something like this: allow sending project files to the LLM for context, althought it might be something you have to do manually.

For this, it is usually really useful for the model if it has access to the rest of the buffer,

This is doable.

Independent of copilot use, I'm not sure what the UI for this will look like. The current model is very simple: everything before the cursor is sent as context, or the region if it is selected. So you do get very basic top-to-cursor context right now.

Do you have any ideas for how to include the "full current buffer as context" option (as opposed to top-to-cursor) in a way that will not confuse the user?

and possibly other buffers, as context.

My understanding is copilot computes a jaccard similarity score to determine which buffers to send, along with some extra statistical analysis (a logistic regression analysis) to determine what data to send and when. I'm interested in this as a project, but it seems out of the scope of gptel, which has "simple LLM client" in the tag line!

RomeoV commented 5 months ago

I can't find it right now but I think this kind of thing exists already? It was python-based, as I recall.

Apparently lsp-bridge can interact with copilot (I haven't tried it though).

Perhaps, though, the more reasonable approach would be to develop some kind of lsp-server binary that wraps your LLM of choice, instead of squeezing all kinds of logic into the emacs plugin. (This is basically what copilot is as a product, if I understand correctly).

I'm not sure what the UI for this will look like.

What I was thinking is that in the gptel-menu for code (e.g. where you also have a "refactor" option), you can display an option like "send last n buffers as context", and the buffers are then pasted before the prompt, or are inserted as history. An additional variable (not exposed via transient) could be whether only files of the same file type are provided as context or no.

Otherwise, there could also be a project-local list variable that you can add buffers to, e.g. by calling gptel-add-buffer-to-context-list. Then, those files would be provided as context as above.

I suspect that it's not going to be fast enough to be a coding assistant.

Probably not as in "instant completion", but I think when refactoring or asking questions about a code snippet, it could definite help and doesn't need to be super fast. Iirc, at least llama.cpp also has some token cashing, so that the other files would only be encoded once (like 60% sure on that one).

My understanding is copilot computes a jaccard similarity score [...]

I agree that this would be overkill for the current project. Tbh I haven't looked too much into the details of what copilot does, and it's probably not trivial to find out which files should be provided in a language-agnostic way. Again, perhaps something for another binary, instead of this package.

karthink commented 5 months ago

Perhaps, though, the more reasonable approach would be to develop some kind of lsp-server binary that wraps your LLM of choice

This is exactly what the Python project in question does. I'll try to find it.

Iirc, at least llama.cpp also has some token cashing, so that the other files would only be encoded once (like 60% sure on that one).

I'll have to look into this. The Ollama API returns an embedding of the chat so far with each request, and you can send buffers just once and send the embedding vectors from then onward. This is not the case for Llama.cpp's OpenAI compatible API, which takes only text with no other state/context.

What I was thinking is that in the gptel-menu for code (e.g. where you also have a "refactor" option), you can display an option like "send last n buffers as context"

I'm thinking the transient menu can include a "set context" sub-menu (like the refactor sub-menu), where the user can choose between a bunch of existing (and new) options:

(default) Send buffer up to (point)
(existing) Send only the last n back-and-forths
Send the full buffer instead of the buffer up to point
Choose additional buffers/files to send
Send all open project buffers

It's going to take some experimentation to see if this menu can work in an intuitive way. It's also tricky to get this to work with different LLM APIs that return/don't return embeddings. Should the files be sent once, or repeatedly with every subsequent request in this buffer?

RomeoV commented 5 months ago

I'll have to look into this.

I think this might be a pointer. Tbh, I'm not too familiar with the internals though, but I have played around with --prompt-cache-all https://github.com/ggerganov/llama.cpp/issues/64

Should the files be sent once, or repeatedly with every subsequent request in this buffer?

Just like in the chat the model has access to the previous parts of the conversation, I would imagine that the code just has to be sent once (and then hopefully cached in some way).

(PS: Need to step away from laptop now for today.)

karthink commented 5 months ago

I think this might be a pointer. Tbh, I'm not too familiar with the internals though, but I have played around with --prompt-cache-all https://github.com/ggerganov/llama.cpp/issues/64

I'll take a look.

Just like in the chat the model has access to the previous parts of the conversation, I would imagine that the code just has to be sent once (and then hopefully cached in some way).

The model does not have access to previous parts of the conversation. All LLM APIs I've seen are stateless, as you'd expect from REST. Either the entire chat is sent with every request (Open AI-compatible APIs), or the model returns an embedding vector representing the conversation that has to be sent with every request (Ollama)

jacereda commented 5 months ago

For speed, a good copilot-like solution will probably involve an LLM inference process that acts as an LSP server -- I can't find it right now but I think this kind of thing exists already? It was python-based, as I recall.

Could this one be https://github.com/freckletonj/uniteai ?

jacereda commented 5 months ago

And the emacs plugin seems to live here: https://github.com/emacs-openai/lsp-uniteai

karthink commented 5 months ago

@jacereda That was the one, thanks! It looks like a better (and more universal) solution for copilot-style usage.

Setting the context manually is still on the table for gptel, I'll work on it when I have a chunk of free time.

karthink commented 5 months ago

@RomeoV a preliminary attempt at a copilot-style workflow.

karthink commented 1 week ago

Support for sending other regions/buffers/files as context has been added in #256 by @daedsidog, so I'm closing this now.

The copilot-feature can be addressed in a separate issue.

karthink / gptel

[FR] Send other buffers as context #176