Closed jwr closed 2 months ago
Practical example: when switching to a gemma:7b-instruct-q8_0 model, things will break:
Ollama error: (HTTP/1.1 500 Internal Server Error) exception [json.exception.type_error.316] invalid UTF-8 byte at index 21: 0x69
But resetting the context in that buffer with (setq gptel--ollama-context nil)
will make it work again.
I tried simply commenting out the (setq gptel--ollama-context...
in gptel-curl--parse-stream
in the gptel-ollama
backend. For my usage this hugely improves the user experience. I get predictable and consistent results in my text buffers, and I know exactly what is being sent.
This is related to the discussion in #272.
I tried simply commenting out the
(setq gptel--ollama-context...
ingptel-curl--parse-stream
in thegptel-ollama
backend. For my usage this hugely improves the user experience.
This is a bad idea, you cannot have a stateful conversation (i.e. more than one response) with Ollama if you remove the context vector.
I get predictable and consistent results in my text buffers, and I know exactly what is being sent.
Only the latest prompt is being sent, so this is probably not what you want.
I need to address the original issue, which is to reset the context after switching Ollama models, which I will get to when I have time for gptel next.
This is a bad idea, you cannot have a stateful conversation (i.e. more than one response) with Ollama if you remove the context vector.
Well, you arguably can, just by sending the whole conversation back, as a block of text, not divided into prompt/response pairs — this is exactly what I'm doing and it works great.
I do not want any hidden context when working with text buffers. What I'm looking for is predictability: I want to send only what I see on the screen.
Resetting the context when switching models is definitely necessary, but at least for me, I do additionally want to disable any hidden state. This might be different for strictly conversational gptel buffers.
Well, you arguably can, just by sending the whole conversation back, as a block of text, not divided into prompt/response pairs — this is exactly what I'm doing and it works great.
If you are using gptel in any buffer with Ollama, and not with a custom function using the lower-level gptel-request
, you cannot be doing this. gptel only collects the latest user prompt when interacting with Ollama.
Resetting the context when switching models is definitely necessary, but at least for me, I do additionally want to disable any hidden state. This might be different for strictly conversational gptel buffers.
As mentioned in #249, it looks like we can avoid this error-prone API by using a newly added, stateless Ollama endpoint.
This might merit opening a new issue, but I'm worried about opening too many issues anyway, so. The context information seems not to be doing its intended job even in conversational (gptel) buffers. Here are the results in two newly created gptel buffers. Note how the conversation gets stuck on the first response and does not recover, even though the model is capable of generating the response to the second question. To do this, I created a gptel buffer, asked the questions, then killed the buffer, created a new one, and asked the questions again in different order. From my point of view, the ollama context does not work correctly, or at least not for all models, which makes it unpredictable.
@jwr Could you switch to the ollama-chat
branch and try using Ollama? I moved gptel-ollama over to the (new-ish) Ollama chat API, so all issues in this thread should be fixed. There is no longer a gptel--ollama-context
variable. It should now be stateless and function exactly like the OpenAI API does.
Please check with the dry run options to be sure that the prompt looks like what you would expect. I can merge it into master after some testing.
The only disadvantage is that you need a recent version of Ollama installed (0.25 or higher), and gptel won't work with older versions any more.
The context-related problems are gone 👍 and I can also switch between models without them breaking or doing unexpected things. Big improvement there! 🙂
I still struggle with actually sending what I actually want to send. I generated some text into *scratch*
and wanted to use it as input. But gptel fights me there, and not very consistently, either: notice for example where the directive is shown when I don't have a region active and where it is when I selected the text I'd like to send (see screenshots). And even with the region active I can't get a meaningful answer from the LLM, because the region gets sent as an "assistant" message.
The assumption that whatever an LLM generated is invisibly marked and can't be considered as part of my input in subsequent queries is not true for me, and I would argue that it doesn't make much sense in non-conversational buffers. Also, even if this functionality were to be useful, the current UI does not indicate which text gets sent and how. In practical terms, this means that when I work with gptel, I have to regularly kill my buffers and re-create them, so as to get rid of the invisible annotations. But I guess that's off-topic for this issue. Most importantly, the context problems are gone!
@jwr thank you for testing! I've merged it into master, you can switch back now.
The context/conversation details can be discussed in #291, I'll close this issue now.
Following up on this:
I am doing plenty of testing with various Ollama models and it occurred to me that the context information isn't meaningful anyway if you switch to a different model, and yet gptel sends the old context information again after switching. I think at the very least it should be deleted when switching models.
Personally, I would still much rather have no state kept between invocations from my buffers, I only expect gptel buffers to be stateful in any way.