Closed karthink closed 1 year ago
try using us-ascii
instead of utf-8
, that fixed the issue for my chatgpt request function. https://github.com/d1egoaz/dotemacs/blob/master/modules/diego-chatgpt.el
(url-request-data (encode-coding-string (json-encode
`(:model ,model
:messages [(:role "system" :content ,sys-content)
(:role "user" :content ,input)]
:temperature 0.7)) 'us-ascii)))
BTW, the answer to your question is: Emacs
I'm still not sure why utf-8 is not working, I only use it for english content so us-ascii might work, I've used accents and ñ, and it still works with 'us-ascii
If I wanted to work on this, what’s the easiest way to reproduce the bug? Just make any request that includes multibyte Unicode characters in the prompt?
I am particularly interested in being able to use Modern Greek characters in a prompt, and I believe but am not sure that those are multibyte.
I’m curious for instance if the answer might be as simple as to set the Content-Type to "application/json;charset=UTF-8" when making the request.
Thanks for taking a crack at it! Just so we're on the same page, this package now defaults to using curl (if available) so this error should not be an issue for most users.
If I wanted to work on this, what’s the easiest way to reproduce the bug?
The let
block in my opening post above is a good start.
Just make any request that includes multibyte Unicode characters in the prompt?
Yup, that'll do it. You can use the message in the opening post above as a test ("qual é o melhor editor, vim ou emacs?")
I’m curious for instance if the answer might be as simple as to set the Content-Type to "application/json;charset=UTF-8" when making the request.
This won't work. The problem is with the Emacs function url-http-create-request: (find-function 'url-http-create-request). The url library checks whether the length of the request string and the size in bytes of the string are the same. If these don't match up an error is thrown and the request is never sent. This is irrespective of the HTTP headers. See also Emacs Bug#23750.
Had a closer look at this. Was easily able to reproduce the problem. Did some basic log debugging and I’m pretty sure your code is preparing the request data payload correctly. At least, the functions length and string-bytes return the same number, and the request data payload string itself does look like what Unicode characters forced into byte string would look like. This suggests the problem is somehow due to passing in header values that cause the overall request to have some improperly prepared values, which causes the overall request to be wrong even if the request data payload is fine. If I can figure out the emacs debugger I’ll see if I can spot where this is happening.
is to call
encoding-coding-string
on the JSON request and sendutf-8
text, but this does not seem to work
FWIW, it works here with Emacs 28.2. Full self-contained test (code is verbose for illustration):
(require 'json)
(let* ((url-request-method "POST")
(url-request-extra-headers
'(("Content-Type" . "application/json")
("Authorization" . "Bearer sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")))
(request-content "qual é o melhor editor, vim ou emacs?")
(request-object `(:model "gpt-3.5-turbo"
:messages [(:role "user"
:content ,request-content)]))
(request-string (json-encode request-object))
(request-bytes (encode-coding-string request-string 'utf-8))
(url-request-data request-bytes)
(response-buffer (url-retrieve-synchronously "https://api.openai.com/v1/chat/completions"))
(response-bytes (with-current-buffer response-buffer
(goto-char 1)
(search-forward "\n\n")
(buffer-substring (point) (point-max))))
(response-string (decode-coding-string response-bytes 'utf-8))
(response-object (json-read-from-string response-string))
(response-choices (cdr (assoc 'choices response-object)))
(response-first-choice (elt response-choices 0))
(response-message (cdr (assoc 'message response-first-choice)))
(response-content (cdr (assoc 'content response-message))))
(message "%s" response-content))
try using
us-ascii
instead ofutf-8
,
That just converts non-ASCII characters to question marks. The model may be able to fill in the gaps with Latin-based languages but will fail completely with other languages.
That self-contained test snippet (updated with a valid API key of course) still does not work for me on Emacs 28.1 or Emacs 28.2 on Linux.
It prints nil, rather than failing with any error associated with multibyte encoding.
I notice it calls url-retrieve-synchronously
while the code we are debugging called url-retrieve
, which is asynchronous (https://github.com/karthink/gptel/blob/master/gptel.el#L400).
Moreover, from adding some logging to it, I suspect it's not passing the authorization information at all, since I get an error saying none were passed. This may be because the bound value url-request-extra-headers
, which is supposed to be passed through dynamic scope, is not being seen for one reason or another. I'm not sure why not.
It prints nil, rather than failing with any error associated with multibyte encoding.
Interesting. Just to confirm, is it the same if you run it with emacs -Q --batch -l test.el
?
I've tested https://github.com/karthink/gptel/issues/4#issuecomment-1488187219
with emacs -Q --batch -l test.el
and it works fine for me.
❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443
Como uma IA, não tenho opinião e não sou capaz de determinar qual o melhor editor, pois cada um tem seus prós e contras e depende das necessidades e preferências de cada usuário. Uma pesquisa sobre as funcionalidades de cada um e experimentá-los pode ajudar a decidir qual é o melhor para sua rotina de trabalho.
However, when you change the code to use url-retrieve
async (as @algal mentioned) it doesn't work:
(require 'json)
(let* ((url-request-method "POST")
(url-request-extra-headers
'(("Content-Type" . "application/json")
("Authorization" . "Bearer <your-token>")))
(request-content "qual é o melhor editor, vim ou emacs?")
(request-object `(:model "gpt-3.5-turbo"
:messages [(:role "user"
:content ,request-content)]))
(request-string (json-encode request-object))
(request-bytes (encode-coding-string request-string 'utf-8))
(url-request-data request-bytes))
(url-retrieve "https://api.openai.com/v1/chat/completions"
(lambda (_status callback &rest args)
(message (buffer-substring-no-properties (1+ url-http-end-of-headers) (point-max))))))
❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443
I've changed the code like below, and it works:
(require 'url)
(require 'json)
(defun my-response-handler (_status callback &rest args)
(message "🤖: %s" (buffer-substring-no-properties (1+ url-http-end-of-headers) (point-max))))
(let* ((api-key "<api-key>")
(url "https://api.openai.com/v1/chat/completions")
(model "gpt-3.5-turbo")
(url-request-method "POST")
(url-request-extra-headers `(("Content-Type" . "application/json")
("Authorization" . ,(format "Bearer %s" api-key))))
(request-content "qual é o melhor editor, vim ou emacs?")
(url-request-data (encode-coding-string
(json-encode `(:model ,model
:messages [(:role "user" :content ,request-content)]))
'utf-8)))
(url-retrieve url #'my-response-handler
'((lambda (result &rest args)
(message result))
nil)))
(sleep-for 10)
❯ emacs -Q --batch -l /tmp/test.el
Contacting host: api.openai.com:443
🤖: {"id":"xxxx","object":"chat.completion","created":1680113895,"model":"gpt-3.5-turbo-0301","usage":{"prompt_tokens":18,"completion_tokens":266,"total_tokens":284},"choices":[{"message":{"role":"assistant","content":"Como uma IA, eu não tenho preferência pessoal, mas posso lhe dizer que ambos os editores são populares e poderosos e têm suas próprias vantagens e desvantagens.\n\nVim é um editor de texto baseado em linha de comando e é conhecido por sua velocidade e eficiência. Ele é altamente configurável e pode ser personalizado para atender às suas necessidades, permitindo que você insira, exiba e edite o conteúdo do arquivo sem precisar tocar no mouse.\n\nPor outro lado, Emacs é um editor de texto e ambiente de desenvolvimento integrado (IDE) de código aberto e personalizável. É conhecido por sua flexibilidade e capacidade de extensão, tornando-o ideal para desenvolvedores que desejam personalizar completamente seu ambiente de trabalho. Emacs tem uma curva de aprendizado íngreme, mas um grande potencial para se tornar um ambiente de trabalho produtivo e personalizado.\n\nDito isto, a escolha entre Vim e Emacs é pessoal e dependerá em grande parte das suas necessidades e preferências pessoais. Ambos são excelentes editores de texto e podem ser usados para uma variedade de tarefas e projetos."},"finish_reason":"stop","index":0}]}
I wonder now if the issue is with some specific emacs version, or environment state 🤷
This is my emacs version:
GNU Emacs 30.0.50 (build 1, aarch64-apple-darwin22.3.0, NS appkit-2299.40 Version 13.2.1 (Build 22D68)) of 2023-03-26
Something interesting I found in my environment is that I have an issue retrieving the key via .authinfo.gpg
.
When I do that, I need to pass the bearer token encoded:
("Authorization" . ,(encode-coding-string(format "Bearer %s" api-key) 'utf-8))))
If I hardcode the key, I only need to do:
("Authorization" . ,(format "Bearer %s" api-key))))
I've pushed these changes to my (yes, another ChatGPT package) https://github.com/d1egoaz/c3po.el
I wonder if you can try that @algal ^
@d1egoaz This produced some progress! I am using a .authinfo
fille (not .authinfo.gpg
). And when I modified in gptel.el
the function gptel-api-key-from-auth-source
, so that instead of returning (funcall secret)
it returns (encode-coding-string (funcall secret) 'utf-8)
, then I became able to make a request successfully without relying on external curl. Hooray!
Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.
Why is this happening? I suspect it's because the lambda which is returned by auth-source itself returns a multibyte string. (I verified this.) This is odd since that string contains only ASCII characters, and its length equals its strings-bytes
count.
But that string needs to be be processed by encode-coding-string
, just like our payload data, or else the request string that is concatenated together by url-http-create-request
ends up being multibyte as well, and also ends up having its length not equal its string-bytes
count, perhaps because of some subtlety of how concat mixes these types.
@CyberShadow
It prints nil, rather than failing with any error associated with multibyte encoding.
Interesting. Just to confirm, is it the same if you run it with
emacs -Q --batch -l test.el
?
I tried again and this time your snippet worked, both in ielm and when run in batch mode. It's possible it's an intermittent issue but I'd bet I just made some kind of typo the first time when trying your snippet. Sorry for wasting your time!
Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.
@algal I had the same issue, and I needed to do: https://github.com/d1egoaz/c3po.el/blob/b81f4c48616932f375104419b3f2d1a74b5db28e/c3po.el#L88
I've been able to use emojies (utf-8 stress test) without issues, I'm pretty sure it should work now for any other case.
Unfortunately, the response is now decoded incorrectly. But there are no errors at the HTTP transport layer anymore.
@algal I had the same issue, and I needed to do: https://github.com/d1egoaz/c3po.el/blob/b81f4c48616932f375104419b3f2d1a74b5db28e/c3po.el#L88
I've been able to use emojies (utf-8 stress test) without issues, I'm pretty sure it should work now for any other case.
@d1egoaz You may be onto something. Your code reads the buffer to a data string, decodes the data strong to utf8, and then parses it as json. The GPTel code, which doesn't handle the response properly, goes in the other order: it parses the headers of the buffer, reads the rest as json, and then later decodes as utf8.
I've got a fix. PR incoming in a few hours.
@algal and @d1egoaz Thank you for debugging this issue and for the fix! Figuring this out is nontrivial -- may I add you to an acknowledgments section in the README?
@karthink Of course, feel free!
Updated README with acknowledgments.
By god this was hard to debug but I think I've found the actual culprit:
(dolist (x (list
"x"
(shell-command-to-string "printf x")
(encode-coding-string (shell-command-to-string "printf x") 'utf-8)))
(let ((s (concat x (encode-coding-string "é" 'utf-8))))
(message
"%S: %s(%s) %s, %s"
s
(multibyte-string-p s)
(multibyte-string-p x)
(string-bytes s)
(length s))))
Output:
"x\303\251": nil(nil) 3, 3
"x\303\251": t(t) 5, 3
"x\303\251": nil(nil) 3, 3
[!note]
url-request
will fail if(length request)
≠(string-bytes request)
Emacs has two kinds of strings: multibyte and unibyte. If you concat multiple strings, and any single one of them is multibyte, the entire result will be multibyte. The body of url-http-create-request
is basically one giant (concat ...)
call, including the HTTP headers. So if any header is multibyte, no matter its contents, it spoils the effect of encoding the request body. In my case I was getting an API key from a call to shell-command-to-string
, and the same was happening to @d1egoaz further up. Encoding that output, even if it's plain ASCII, ensures the actual string type is unibyte, which in turn ensures the final HTTP request is unibyte. This is why you have to encode the API key.
String literals without multibyte characters are automatically unibyte, which is why you won't trigger this bug if you paste your API key literally into the REPL. This can be quite confusing while debugging! Haha.
Thank you, @hraban! 🏆 Now I can finally put this behind me and move on. It's always been bothering me to not know the source of the problem.
Thank you so much!
Wow, thank you! This bug was driving me crazy! Had the same issue in ELAIZA (https://github.com/SFTtech/emacs-elaiza/commit/e0d088140d825bb7920f56443957c99c793ff9de)
The proposed solution for Emacs bug 23750 is to call
encoding-coding-string
on the JSON request and sendutf-8
text, but this does not seem to work. This bug and workaround is also explained here.url-http-create-request
still complains about the message length and message size (in bytes) not lining up.