(Ollama) Old context information being sent after switching to a different model

jwr commented 3 months ago

Following up on this:

This is not possible with Ollama. Unlike the OpenAI-inspired APIs, the Ollama API is stateful and works by passing a growing context vector back and forth along with the latest user prompt (and nothing else). Coupled with the previous point, you can see how only the last chunk is sent. Hmm. Ok, I thought it was possible to simply initiate an entirely new session with Ollama every time, with no context. Or drop the context and get a stateless API.

I am doing plenty of testing with various Ollama models and it occurred to me that the context information isn't meaningful anyway if you switch to a different model, and yet gptel sends the old context information again after switching. I think at the very least it should be deleted when switching models.

Personally, I would still much rather have no state kept between invocations from my buffers, I only expect gptel buffers to be stateful in any way.

jwr commented 2 months ago

Practical example: when switching to a gemma:7b-instruct-q8_0 model, things will break:

Ollama error: (HTTP/1.1 500 Internal Server Error) exception [json.exception.type_error.316] invalid UTF-8 byte at index 21: 0x69

But resetting the context in that buffer with (setq gptel--ollama-context nil) will make it work again.

jwr commented 2 months ago

I tried simply commenting out the (setq gptel--ollama-context... in gptel-curl--parse-stream in the gptel-ollama backend. For my usage this hugely improves the user experience. I get predictable and consistent results in my text buffers, and I know exactly what is being sent.

This is related to the discussion in #272.

karthink commented 2 months ago

I tried simply commenting out the (setq gptel--ollama-context... in gptel-curl--parse-stream in the gptel-ollama backend. For my usage this hugely improves the user experience.

This is a bad idea, you cannot have a stateful conversation (i.e. more than one response) with Ollama if you remove the context vector.

I get predictable and consistent results in my text buffers, and I know exactly what is being sent.

Only the latest prompt is being sent, so this is probably not what you want.

I need to address the original issue, which is to reset the context after switching Ollama models, which I will get to when I have time for gptel next.

jwr commented 2 months ago

This is a bad idea, you cannot have a stateful conversation (i.e. more than one response) with Ollama if you remove the context vector.

Well, you arguably can, just by sending the whole conversation back, as a block of text, not divided into prompt/response pairs — this is exactly what I'm doing and it works great.

I do not want any hidden context when working with text buffers. What I'm looking for is predictability: I want to send only what I see on the screen.

Resetting the context when switching models is definitely necessary, but at least for me, I do additionally want to disable any hidden state. This might be different for strictly conversational gptel buffers.

karthink commented 2 months ago

Well, you arguably can, just by sending the whole conversation back, as a block of text, not divided into prompt/response pairs — this is exactly what I'm doing and it works great.

If you are using gptel in any buffer with Ollama, and not with a custom function using the lower-level gptel-request, you cannot be doing this. gptel only collects the latest user prompt when interacting with Ollama.

Resetting the context when switching models is definitely necessary, but at least for me, I do additionally want to disable any hidden state. This might be different for strictly conversational gptel buffers.

As mentioned in #249, it looks like we can avoid this error-prone API by using a newly added, stateless Ollama endpoint.

jwr commented 2 months ago

This might merit opening a new issue, but I'm worried about opening too many issues anyway, so. The context information seems not to be doing its intended job even in conversational (gptel) buffers. Here are the results in two newly created gptel buffers. Note how the conversation gets stuck on the first response and does not recover, even though the model is capable of generating the response to the second question. To do this, I created a gptel buffer, asked the questions, then killed the buffer, created a new one, and asked the questions again in different order. From my point of view, the ollama context does not work correctly, or at least not for all models, which makes it unpredictable.

karthink commented 2 months ago

@jwr Could you switch to the ollama-chat branch and try using Ollama? I moved gptel-ollama over to the (new-ish) Ollama chat API, so all issues in this thread should be fixed. There is no longer a gptel--ollama-context variable. It should now be stateless and function exactly like the OpenAI API does.

Please check with the dry run options to be sure that the prompt looks like what you would expect. I can merge it into master after some testing.

karthink commented 2 months ago

The only disadvantage is that you need a recent version of Ollama installed (0.25 or higher), and gptel won't work with older versions any more.

jwr commented 2 months ago

The context-related problems are gone 👍 and I can also switch between models without them breaking or doing unexpected things. Big improvement there! 🙂

I still struggle with actually sending what I actually want to send. I generated some text into *scratch* and wanted to use it as input. But gptel fights me there, and not very consistently, either: notice for example where the directive is shown when I don't have a region active and where it is when I selected the text I'd like to send (see screenshots). And even with the region active I can't get a meaningful answer from the LLM, because the region gets sent as an "assistant" message.

The assumption that whatever an LLM generated is invisibly marked and can't be considered as part of my input in subsequent queries is not true for me, and I would argue that it doesn't make much sense in non-conversational buffers. Also, even if this functionality were to be useful, the current UI does not indicate which text gets sent and how. In practical terms, this means that when I work with gptel, I have to regularly kill my buffers and re-create them, so as to get rid of the invisible annotations. But I guess that's off-topic for this issue. Most importantly, the context problems are gone!

karthink commented 2 months ago

@jwr thank you for testing! I've merged it into master, you can switch back now.

The context/conversation details can be discussed in #291, I'll close this issue now.

karthink / gptel

(Ollama) Old context information being sent after switching to a different model #279