some words get lost when using a remote server

werelax commented 9 months ago

First of all, thanks for this awesome package! I've been using it and enjoying it heavily :)

But I'm having a weird issue since the migration to using llm as a backend. Here is a full report:

Issue: Generation Issues with Remote Server after Migration to LLM Backend

Description: After migrating to using llm as a backend, I noticed that when I change the host to a remote server, the generation output appears incomplete with some words missing.

Steps to Reproduce:

Change the host to a remote server.
Execute the generation.

Observed Behavior: The generated output on the remote server is missing some words as shown in this example:

Additional Information:

This issue was not observed when generating locally.
The generation from the remote server is very fast, approximately 10 times faster than my local generation.
Prior to the migration to llm, the same server with identical settings worked flawlessly.

Thank you for your attention to this matter and for the great package you've provided!

s-kostyaev commented 9 months ago

@ahyatt need your help here.

werelax commented 9 months ago

As a follow-up, I tested ellama locally on the remote server and observed a slightly different behavior. Instead of omitting random words, it consistently misses one or two words at the beginning of the generation:

However, the generations work perfectly when executed through the ollama CLI client.

ahyatt commented 9 months ago

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

ahyatt commented 9 months ago

BTW, @werelax when you say remote server, what precisely do you mean?

s-kostyaev commented 9 months ago

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

Why do you think, that the problem was on server side?

werelax commented 9 months ago

@ahyatt remote server = a server with an rtx4090 running ollama.

I tried two scenarios:

On my laptop, I changed the llm-ollama-host in ellama-provider to point to the server's address.
I logged into the server via SSH and ran ellama in the server's Emacs installation.

Both scenarios resulted in corrupted generations. I believe the issue isn't with the server since everything worked fine with the previous version of ellama.

s-kostyaev commented 9 months ago

@ahyatt I can create test stand as docker container for you. It will work without ollama and on any request will reply same ollama-formatted message fast. I bet problem in code for parsing ollama replies, but I can't check it now.

werelax commented 9 months ago

@ahyatt some more info, in case it helps:

I logged the responses received from the server when generating. In this generation:

The server is sending all the words:

My guess is that llm-ollama--get-partial-chat-response is missing some of the json objects.

ahyatt commented 9 months ago

I figured out the issue - basically we aren't dealing with partial json responses correctly. I have a fix for Ollama, but I need to see if the same issue can hit my other providers. Once everything is working (should be tonight), I'll check everything in and create a new release.

ahyatt commented 9 months ago

Well, actually, I maybe only figured out one issue. I can still reproduce the problem even after my fix, though. I also need to work more on my similar change to Open AI. For this reason, although I checked in a change, I'm not cutting a release tonight, but will resume work on this tomorrow.

werelax commented 9 months ago

@ahyatt I've spent some time investigating, and I believe I've found the root of the issue. At its core, unexpected changes in the parameter response from llm-ollama--get-partial-chat-response are causing trouble. The code presumes that this parameter only expands by appending new content at its end, but that's not the case in reality. Let's break down the issues:

Problem 1

The initial issue lies in how ollama concludes its lines with ^M. This peculiar ending causes the function llm-request--content to be unable to strip the headers after the initial call to llm-ollama--get-partial-chat-response. When you log the values of response, the first call appears as:

Content-Type: application/x-ndjson^M
Date: Tue, 31 Oct 2023 17:00:48 GMT^M
Transfer-Encoding: chunked^M
^M
6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M

This is because the expression in llm-request--content

    (or (and (boundp 'url-http-end-of-headers) url-http-end-of-headers)
        (save-match-data
          (save-excursion
            (goto-char (point-min))
            (search-forward "\n\n" nil t)
            (forward-line)
            (point))))

executes without having the binding of url-http-end-of-headers active yet, and search-forward doesn't find a match. The resulting action only removes the HTTP status line from the server response, leaving the headers untouched.

Upon the second call to llm-ollama--get-partial-chat-response, url-http-end-of-headers is indeed bound. This leads to the or expression yielding a different outcome. Consequently, the value of the response parameter gets altered "from the left", causing llm-ollama-last-position to mispoint. Here's the response value from the second call:

6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}^M
^M
6e^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}^M

This leftward shift in response content means llm-ollama-last-position now inadvertently targets the center of the second or third JSON object. The fallout from this is a noticeable gap of a few missing words right after the first word in my ellama generations.

A potential solution might involve modifying the (search-forward) expression in llm-request--content. It doesn't seem to be a major change.

Problem 2

The subsequent problem is more complex. Responses from ollama come with Transfer-Encoding: chunked, evident by the interspersed numbers (6a, 6d, 6e, ...) among the JSON lines. That's understandable. However, post-chunk processing, url-http purges these numbers from the buffer. When observing the value of response from llm-ollama--get-partial-chat-response, the modification is evident:

{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.576940246Z","response":" from","done":false}^M
^M
69^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.583854197Z","response":" ","done":false}^M
^M

As a result, the initial set of numbers is stripped away. This leads to llm-ollama-last-position overlooking a few characters, missing the opening curly brace { of the next object, and, as a consequene, bypassing the following generated word altogether.

If my understanding is correct, the entire streaming framework of llm hinges on the after-change-functions hook linked to the buffer given back by url-request. This setup is robust when url-request merely augments content to the buffer. However, it breaks when other alterations occur to the buffer content, as is the case here.

I'm definitely not an expert (and I'm probably wrong), but I think the only solution to this is to use an external process with curl or similar to handle the streaming like the previous version of ellama was doing.

werelax commented 9 months ago

As a side comment, I've also noticed that llm-ollama--get-partial-chat-response gets called multiple times with the exact same response. I don't understand why.

ahyatt commented 9 months ago

@werelax thank you very much for this investigation - I noticed those numbers too, but didn't realize that they could be causing us to miss things. I think there's some simple things we can try, notably using markers in the response buffer. I'll try those today. About using curl, I believe curl is actually used by the url-http library already, but haven't investigated the details.

As to why it gets called multiple times - I fixed one in the commit above. It's because, as you mention, the streamed request is doing more than just appending. This seems unique to ollama. I've changed the code to only pay attention to appends. The interesting thing I noticed last night was now, we're only called once with the same request except when we miss a response, when we are called twice. Your problem statement doesn't seem to capture this aspect, but I bet there's some related behavior to what you describe that is causing the same request to be sent twice. Perhaps once before the number, one after the number is purged? But then why does it not happen more often?

werelax commented 9 months ago

@ahyatt for what I understand, url-http first inserts the raw response from the server in the response buffer and then url-http-chunked-encoding-after-change-function does a second pass and processes the chunks (and removes the chunk-length numbers). The only way can I see to solve the issue using markers would be to only move the marker after the second pass is done. But the code of url-http-chunked-encoding-after-change-function is not easy to follow (at least not for me). This approach might end up being too complicated and brittle.

IMHO, using (make-process) to run curl and using a :filter function to process the output would be a faster and more robust way to handle the requests.

As a side note, I don't think url-http is using curl. It uses the elisp-native url library. That's probably why all the low-level HTTP processing is done inside a buffer.

ahyatt commented 9 months ago

I think you probably are right about using curl. If possible, I'd like to use it via the plz GNU ELPA library, which should allow me to get rid of llm-request entirely. Whatever I do, this is a bigger change that I'll leave until later.

I've fixed the issue by basically dealing with things on a line-by-line basis, throwing away non-valid JSON lines, and instead of keeping the position around, we keep around the last parsed message number. This seems to work well for me, but please try the latest commits yourself and see if it solves your problem. If it does, I'll release the fix.

werelax commented 9 months ago

@ahyatt I've done some testing and everything worked fine. Thank you very much for the fix!

ahyatt commented 9 months ago

Cool, I'm going to make a new release and push it now. Thank you for reporting this and your excellent debugging!

s-kostyaev commented 9 months ago

@werelax thank you for report and debugging. @ahyatt thank you for fast fix.

s-kostyaev / ellama

some words get lost when using a remote server #8

Problem 1

Problem 2