Multi-byte char handling with url-retrieve is broken

karthink commented 1 year ago

The proposed solution for Emacs bug 23750 is to call encoding-coding-string on the JSON request and send utf-8 text, but this does not seem to work. This bug and workaround is also explained here.

(let (((url-request-method "POST")
       (url-request-extra-headers
        '(("Content-Type" . "application/json")))
       (url-request-data
        (encode-coding-string
         (json-encode
          '(:messages [(:role "user"
                        :content "qual é o melhor editor, vim ou emacs?")]))
         'utf-8))))
  (url-retrieve-synchronously "https://api.openai.com/v1/chat/completions"))

url-http-create-request still complains about the message length and message size (in bytes) not lining up.

d1egoaz commented 1 year ago

try using us-ascii instead of utf-8, that fixed the issue for my chatgpt request function. https://github.com/d1egoaz/dotemacs/blob/master/modules/diego-chatgpt.el

         (url-request-data (encode-coding-string (json-encode
                                                  `(:model ,model
                                                           :messages [(:role "system" :content ,sys-content)
                                                                      (:role "user" :content ,input)]
                                                           :temperature 0.7)) 'us-ascii)))

BTW, the answer to your question is: Emacs

d1egoaz commented 1 year ago

I'm still not sure why utf-8 is not working, I only use it for english content so us-ascii might work, I've used accents and ñ, and it still works with 'us-ascii

algal commented 1 year ago

If I wanted to work on this, what’s the easiest way to reproduce the bug? Just make any request that includes multibyte Unicode characters in the prompt?

I am particularly interested in being able to use Modern Greek characters in a prompt, and I believe but am not sure that those are multibyte.

I’m curious for instance if the answer might be as simple as to set the Content-Type to "application/json;charset=UTF-8" when making the request.

karthink commented 1 year ago

Thanks for taking a crack at it! Just so we're on the same page, this package now defaults to using curl (if available) so this error should not be an issue for most users.

If I wanted to work on this, what’s the easiest way to reproduce the bug?

The let block in my opening post above is a good start.

Just make any request that includes multibyte Unicode characters in the prompt?

Yup, that'll do it. You can use the message in the opening post above as a test ("qual é o melhor editor, vim ou emacs?")

I’m curious for instance if the answer might be as simple as to set the Content-Type to "application/json;charset=UTF-8" when making the request.

This won't work. The problem is with the Emacs function url-http-create-request: (find-function 'url-http-create-request). The url library checks whether the length of the request string and the size in bytes of the string are the same. If these don't match up an error is thrown and the request is never sent. This is irrespective of the HTTP headers. See also Emacs Bug#23750.

algal commented 1 year ago

Had a closer look at this. Was easily able to reproduce the problem. Did some basic log debugging and I’m pretty sure your code is preparing the request data payload correctly. At least, the functions length and string-bytes return the same number, and the request data payload string itself does look like what Unicode characters forced into byte string would look like. This suggests the problem is somehow due to passing in header values that cause the overall request to have some improperly prepared values, which causes the overall request to be wrong even if the request data payload is fine. If I can figure out the emacs debugger I’ll see if I can spot where this is happening.

CyberShadow commented 1 year ago

is to call encoding-coding-string on the JSON request and send utf-8 text, but this does not seem to work

FWIW, it works here with Emacs 28.2. Full self-contained test (code is verbose for illustration):

(require 'json)
(let* ((url-request-method "POST")
       (url-request-extra-headers
        '(("Content-Type" . "application/json")
          ("Authorization" . "Bearer sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")))
       (request-content "qual é o melhor editor, vim ou emacs?")
       (request-object `(:model "gpt-3.5-turbo"
                    :messages [(:role "user"
                              :content ,request-content)]))
       (request-string (json-encode request-object))
       (request-bytes (encode-coding-string request-string 'utf-8))
       (url-request-data request-bytes)
       (response-buffer (url-retrieve-synchronously "https://api.openai.com/v1/chat/completions"))
       (response-bytes (with-current-buffer response-buffer
                 (goto-char 1)
                 (search-forward "\n\n")
                 (buffer-substring (point) (point-max))))
       (response-string (decode-coding-string response-bytes 'utf-8))
       (response-object (json-read-from-string response-string))
       (response-choices (cdr (assoc 'choices response-object)))
       (response-first-choice (elt response-choices 0))
       (response-message (cdr (assoc 'message response-first-choice)))
       (response-content (cdr (assoc 'content response-message))))
  (message "%s" response-content))

try using us-ascii instead of utf-8,

That just converts non-ASCII characters to question marks. The model may be able to fill in the gaps with Latin-based languages but will fail completely with other languages.

algal commented 1 year ago

That self-contained test snippet (updated with a valid API key of course) still does not work for me on Emacs 28.1 or Emacs 28.2 on Linux.

It prints nil, rather than failing with any error associated with multibyte encoding.

I notice it calls url-retrieve-synchronously while the code we are debugging called url-retrieve, which is asynchronous (https://github.com/karthink/gptel/blob/master/gptel.el#L400).

Moreover, from adding some logging to it, I suspect it's not passing the authorization information at all, since I get an error saying none were passed. This may be because the bound value url-request-extra-headers, which is supposed to be passed through dynamic scope, is not being seen for one reason or another. I'm not sure why not.

CyberShadow commented 1 year ago

It prints nil, rather than failing with any error associated with multibyte encoding.

Interesting. Just to confirm, is it the same if you run it with emacs -Q --batch -l test.el?

d1egoaz commented 1 year ago

I've tested https://github.com/karthink/gptel/issues/4#issuecomment-1488187219 with emacs -Q --batch -l test.el and it works fine for me.

❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443
Como uma IA, não tenho opinião e não sou capaz de determinar qual o melhor editor, pois cada um tem seus prós e contras e depende das necessidades e preferências de cada usuário. Uma pesquisa sobre as funcionalidades de cada um e experimentá-los pode ajudar a decidir qual é o melhor para sua rotina de trabalho.

However, when you change the code to use url-retrieve async (as @algal mentioned) it doesn't work:

(require 'json)
(let* ((url-request-method "POST")
       (url-request-extra-headers
        '(("Content-Type" . "application/json")
          ("Authorization" . "Bearer <your-token>")))
       (request-content "qual é o melhor editor, vim ou emacs?")
       (request-object `(:model "gpt-3.5-turbo"
                                :messages [(:role "user"
                                                  :content ,request-content)]))
       (request-string (json-encode request-object))
       (request-bytes (encode-coding-string request-string 'utf-8))
       (url-request-data request-bytes))
  (url-retrieve "https://api.openai.com/v1/chat/completions"
                (lambda (_status callback &rest args)
                  (message (buffer-substring-no-properties (1+ url-http-end-of-headers) (point-max))))))

❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443

I've changed the code like below, and it works:

(require 'url)
(require 'json)

(defun my-response-handler (_status callback &rest args)
  (message "🤖: %s" (buffer-substring-no-properties (1+ url-http-end-of-headers) (point-max))))

(let* ((api-key "<api-key>")
       (url "https://api.openai.com/v1/chat/completions")
       (model "gpt-3.5-turbo")
       (url-request-method "POST")
       (url-request-extra-headers `(("Content-Type" . "application/json")
                                    ("Authorization" . ,(format "Bearer %s" api-key))))
       (request-content "qual é o melhor editor, vim ou emacs?")
       (url-request-data (encode-coding-string
                          (json-encode `(:model ,model
                                                :messages [(:role "user" :content ,request-content)]))
                          'utf-8)))
  (url-retrieve url #'my-response-handler
                '((lambda (result &rest args)
                    (message result))
                  nil)))
(sleep-for 10)

❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443
🤖: {"id":"xxxx","object":"chat.completion","created":1680113895,"model":"gpt-3.5-turbo-0301","usage":{"prompt_tokens":18,"completion_tokens":266,"total_tokens":284},"choices":[{"message":{"role":"assistant","content":"Como uma IA, eu não tenho preferência pessoal, mas posso lhe dizer que ambos os editores são populares e poderosos e têm suas próprias vantagens e desvantagens.\n\nVim é um editor de texto baseado em linha de comando e é conhecido por sua velocidade e eficiência. Ele é altamente configurável e pode ser personalizado para atender às suas necessidades, permitindo que você insira, exiba e edite o conteúdo do arquivo sem precisar tocar no mouse.\n\nPor outro lado, Emacs é um editor de texto e ambiente de desenvolvimento integrado (IDE) de código aberto e personalizável. É conhecido por sua flexibilidade e capacidade de extensão, tornando-o ideal para desenvolvedores que desejam personalizar completamente seu ambiente de trabalho. Emacs tem uma curva de aprendizado íngreme, mas um grande potencial para se tornar um ambiente de trabalho produtivo e personalizado.\n\nDito isto, a escolha entre Vim e Emacs é pessoal e dependerá em grande parte das suas necessidades e preferências pessoais. Ambos são excelentes editores de texto e podem ser usados para uma variedade de tarefas e projetos."},"finish_reason":"stop","index":0}]}

I wonder now if the issue is with some specific emacs version, or environment state 🤷

This is my emacs version:

GNU Emacs 30.0.50 (build 1, aarch64-apple-darwin22.3.0, NS appkit-2299.40 Version 13.2.1 (Build 22D68)) of 2023-03-26

d1egoaz commented 1 year ago

Something interesting I found in my environment is that I have an issue retrieving the key via .authinfo.gpg.

When I do that, I need to pass the bearer token encoded:

("Authorization" . ,(encode-coding-string(format "Bearer %s" api-key) 'utf-8))))

If I hardcode the key, I only need to do:

("Authorization" . ,(format "Bearer %s" api-key))))

I've pushed these changes to my (yes, another ChatGPT package) https://github.com/d1egoaz/c3po.el

d1egoaz commented 1 year ago

I wonder if you can try that @algal ^

algal commented 1 year ago

@d1egoaz This produced some progress! I am using a .authinfo fille (not .authinfo.gpg). And when I modified in gptel.el the function gptel-api-key-from-auth-source, so that instead of returning (funcall secret) it returns (encode-coding-string (funcall secret) 'utf-8), then I became able to make a request successfully without relying on external curl. Hooray!

Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.

Why is this happening? I suspect it's because the lambda which is returned by auth-source itself returns a multibyte string. (I verified this.) This is odd since that string contains only ASCII characters, and its length equals its strings-bytes count.

But that string needs to be be processed by encode-coding-string, just like our payload data, or else the request string that is concatenated together by url-http-create-request ends up being multibyte as well, and also ends up having its length not equal its string-bytes count, perhaps because of some subtlety of how concat mixes these types.

algal commented 1 year ago

@CyberShadow

It prints nil, rather than failing with any error associated with multibyte encoding.

Interesting. Just to confirm, is it the same if you run it with emacs -Q --batch -l test.el?

I tried again and this time your snippet worked, both in ielm and when run in batch mode. It's possible it's an intermittent issue but I'd bet I just made some kind of typo the first time when trying your snippet. Sorry for wasting your time!

d1egoaz commented 1 year ago

Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.

@algal I had the same issue, and I needed to do: https://github.com/d1egoaz/c3po.el/blob/b81f4c48616932f375104419b3f2d1a74b5db28e/c3po.el#L88

I've been able to use emojies (utf-8 stress test) without issues, I'm pretty sure it should work now for any other case.

algal commented 1 year ago

Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.

@algal I had the same issue, and I needed to do: https://github.com/d1egoaz/c3po.el/blob/b81f4c48616932f375104419b3f2d1a74b5db28e/c3po.el#L88

I've been able to use emojies (utf-8 stress test) without issues, I'm pretty sure it should work now for any other case.

@d1egoaz You may be onto something. Your code reads the buffer to a data string, decodes the data strong to utf8, and then parses it as json. The GPTel code, which doesn't handle the response properly, goes in the other order: it parses the headers of the buffer, reads the rest as json, and then later decodes as utf8.

algal commented 1 year ago

I've got a fix. PR incoming in a few hours.

karthink commented 1 year ago

@algal and @d1egoaz Thank you for debugging this issue and for the fix! Figuring this out is nontrivial -- may I add you to an acknowledgments section in the README?

algal commented 1 year ago

@karthink Of course, feel free!

karthink commented 1 year ago

Updated README with acknowledgments.

hraban commented 5 months ago

By god this was hard to debug but I think I've found the actual culprit:

(dolist (x (list
            "x"
            (shell-command-to-string "printf x")
            (encode-coding-string (shell-command-to-string "printf x") 'utf-8)))
  (let ((s (concat x (encode-coding-string "é" 'utf-8))))
    (message
     "%S: %s(%s) %s, %s"
     s
     (multibyte-string-p s)
     (multibyte-string-p x)
     (string-bytes s)
     (length s))))

Output:

"x\303\251": nil(nil) 3, 3
"x\303\251": t(t) 5, 3
"x\303\251": nil(nil) 3, 3

[!note] url-request will fail if (length request) ≠ (string-bytes request)

Emacs has two kinds of strings: multibyte and unibyte. If you concat multiple strings, and any single one of them is multibyte, the entire result will be multibyte. The body of url-http-create-request is basically one giant (concat ...) call, including the HTTP headers. So if any header is multibyte, no matter its contents, it spoils the effect of encoding the request body. In my case I was getting an API key from a call to shell-command-to-string, and the same was happening to @d1egoaz further up. Encoding that output, even if it's plain ASCII, ensures the actual string type is unibyte, which in turn ensures the final HTTP request is unibyte. This is why you have to encode the API key.

String literals without multibyte characters are automatically unibyte, which is why you won't trigger this bug if you paste your API key literally into the REPL. This can be quite confusing while debugging! Haha.

d1egoaz commented 4 months ago

hraban

Thank you, @hraban! 🏆 Now I can finally put this behind me and move on. It's always been bothering me to not know the source of the problem.

Thank you so much!

AlessandroW commented 3 months ago

Wow, thank you! This bug was driving me crazy! Had the same issue in ELAIZA (https://github.com/SFTtech/emacs-elaiza/commit/e0d088140d825bb7920f56443957c99c793ff9de)

karthink / gptel

Multi-byte char handling with url-retrieve is broken #4