Chinese input will lead an error

YijiaZHONG commented 1 year ago

System Info

MAC OS 13.1 13.1 22C65 Python3.11

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] backend
[ ] bindings
[ ] python-bindings
[ ] chat-ui
[ ] models
[ ] circleci
[ ] docker
[ ] api

Reproduction

By using below code:

import gpt4all gptj = gpt4all.GPT4All("ggml-gpt4all-j-v1.3-groovy") messages = [{"role": "user", "content": "你好"}] gptj.chat_completion(messages)

It returns error like:

...python3.11/site-packages/gpt4all/pyllmodel.py", line 204, in _response_callback print(response.decode('utf-8')) ^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1: unexpected end of data

Expected behavior

Should work properly

cosmic-snow commented 1 year ago

My guess is that error means the model itself didn't encode it properly in UTF-8. I don't really know about the capabilities for Chinese in the models, although previously, I've seen mpt-7b-chat output some Chinese (maybe that one is better?).

Now if you want to experiment with it, since this is Python, you can just go and edit the line where the error happens. You can try:

using a different encoding, such as UTF-16
making the UTF-8 decoding process more tolerant by changing it to:
```
print(response.decode('utf-8', errors='ignore'))  # alternatively: errors='replace'
```
see: https://docs.python.org/3.11/library/stdtypes.html#bytes.decode

cosmic-snow commented 1 year ago

Is it to change gptj = GPT4All ("ggml-gpt4all-j-v1.3-groovy") to gptj = GPT4All("mpt-7b-chat", model_type="mpt")?

I haven't used the Python bindings myself, just the GUI, but yes that looks about right. Of course, you'll have to download that model separately.

doudou-itachi commented 1 year ago

是否要将 gptj = GPT4All （“ggml-gpt4all-j-v1.3-groovy”）更改为 gptj = GPT4All（“mpt-7b-chat”， model_type=“mpt”）？

我自己没有使用过 Python 绑定，只是使用 GUI，但是是的，这看起来是正确的。当然，您必须单独下载该模型。

ok，I see some model names by list_models() this function

cosmic-snow commented 1 year ago

Ah, actually, when looking in my file browser the file name is: ggml-mpt-7b-chat.bin

doudou-itachi commented 1 year ago

啊，实际上，在我的文件浏览器中查找时，文件名是：ggml-mpt-7b-chat.bin You can take a look at it based on the official example, with .bin removed from the code ... GPT4All("ggml-gpt4all-j-v1.3-groovy").list_models()

YijiaZHONG commented 1 year ago

tried to errors='ignore' and get the same error

print(response.decode('utf-8'), errors='replace') ^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1: unexpected end of data

==================================== Not sure whether the response send out the data in a right encoding. If I change the code to print response, it returns:

Prompt:

你好

Response:

b' \xe6'b'\x82'b'\xa8'b'\xe5\xa5'b'\xbd'b'\xe3\x80\x82'b'\xe6\x88'b'\x91'b'\xe4\xb9'b'\x9f'b'\xe6\x98\xaf'b'\xe8\xbf'b'\x99'b'\xe6\xa0'b'\xb7'b'\xe7\x9a\x84'b'\xe4\xba\xba'b'\xe3\x80\x82'

cosmic-snow commented 1 year ago

print(response.decode('utf-8'), errors='replace')

is incorrect. try:

print(response.decode('utf-8', errors='replace'))

cosmic-snow commented 1 year ago

I get for the first bytes string you posted with:

errors= 'replace': 您好。我也是这�
errors= 'ignore': 您好。我也是这

So that works for me.

YijiaZHONG commented 1 year ago

Yes, now it works

cosmic-snow commented 1 year ago

Oh, I didn't notice it earlier and thought this was the first bytes string: b' \xe6'b'\x82'b'\xa8'b'\xe5\xa5'b'\xbd'b'\xe3\x80\x82'b'\xe6\x88'b'\x91'b'\xe4\xb9'b'\x9f'b'\xe6\x98\xaf'b'\xe8\xbf'b'\x99'b'\xe6\xa0'

But it's actually many individual ones: b' \xe6' b'\xa8' b'\xe5\xa5' ... I just pasted that into a Python console to see, and Python automatically concatenates them together when doing that.

So what you probably want to do instead of printing the individual responses right away is to collect the full response into one big bytes string and then decode that. That should help with when individual Unicode code points are "cut in the middle".

doudou-itachi commented 1 year ago

是的，现在它可以工作了

But the results seemed to be different from what I expected

cosmic-snow commented 1 year ago

But the results seemed to be different from what I expected

What do you mean? Also, did you just use errors='replace' or errors='ignore'?

doudou-itachi commented 1 year ago

The model I use is "ggml-gpt4all-j-v1.3-groovy", and under the premise of changing print(response.decode('utf-8')) to print(response.decode('utf-8', errors='ignore')), I ask a question about python The answer is Python people. Python "Hello World!". ![Uploading QQ截图20230523191937.png…]()

cosmic-snow commented 1 year ago

Try a different model then. It depends a lot on what input a model was trained on. I don't know how much Chinese went into groovy. Also, I'm not a Chinese speaker, either, so I can't really tell.

Maybe try mpt-7b-chat.

Or maybe even wizardLM-7B.q4_2. That one says it was created by people from Microsoft and the University of Beijing. I haven't tried that one myself yet, though.

doudou-itachi commented 1 year ago

OK thank you Model "mpt-7b-chat" I also tried, the problem is the same, there is garbled characters

cosmic-snow commented 1 year ago

As I said before:

Also, did you just use errors='replace' or errors='ignore'?

And what I meant in https://github.com/nomic-ai/gpt4all/issues/695#issuecomment-1559057008

It just gives back raw bytes in chunks, but not all raw bytes are valid Unicode characters. So for example, 是 encoded in UTF-8 bytes is b'\xe6\x98\xaf'. But if one response is b'\xe6\x98' and the second one is b'\xaf', you won't get the right result when using decode() on them individually. You first have to put everything back together again, so that you have b'\xe6\x98\xaf'.decode('utf-8')

doudou-itachi commented 1 year ago

Yes, I also saw it when I was debugging, he is divided, so even if it is applicable in some cases, but if the content of the question is changed, there may still be a problem, and if the model is changed, it may not be compatible, so I haven't thought of a better solution yet

cosmic-snow commented 1 year ago

I'll have to play around with the Python bindings myself (but not right now, haven't set it up yet).

But basically what you have to do is wait for it to be finished with the response, then concatenate all the bytes, and only then decode and print it. Not sure if there is an easy way to do that. Maybe the API needs a callback "done with the response", I don't know.

doudou-itachi commented 1 year ago

For the time being, I want to understand all the models first, and then choose the right model to debug and take a look

cosmic-snow commented 1 year ago

Actually, having a closer look with the example and the code, I think you can do something like:

Replace the DualStreamProcessor with an io.BytesIO so we just collect bytes without trying to convert to Unicode: https://github.com/nomic-ai/gpt4all/blob/8e705d730d6240e4519e4a090f459a471443458f/gpt4all-bindings/python/gpt4all/pyllmodel.py#L198 replace with:

# stream_processor = DualStreamProcessor() 
import io
stream_processor = io.BytesIO()

Replace the line in _response_callback that would cause an error without errors='...' and just use raw bytes:

# print(response.decode('utf-8', errors='replace'))
print(response)  # now just writes to the BytesIO buffer

Then instead of returning stream_processor.output here (BytesIO doesn't have that): https://github.com/nomic-ai/gpt4all/blob/8e705d730d6240e4519e4a090f459a471443458f/gpt4all-bindings/python/gpt4all/pyllmodel.py#L232 do this:

# return stream_processor.output
stream_processor.seek(0)
return stream_processor.read()  # read all the bytes from the start

Finally, disable streaming in the example code, or what you're using to call the API and decode the bytes yourself:

messages = [{"role": "user", "content": "Name 3 colors"}]
response = gptj.chat_completion(messages, streaming=False)
print(response.decode('utf-8', errors='replace'))

Haven't tested that yet, but I'll install the Python bindings here and see if it works.

cosmic-snow commented 1 year ago

Edit 2023-06-03: This was done on an older version of the project (2023-05-23). Things have changed a bit since then, you might have to adapt some parts now or check out an old version instead.

Alright, I've tested it and this turned into quite a bit of a hack, but at least I made it work in the end. Here is my own example chat client:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from gpt4all import GPT4All

def main():
    # Retrieve model
    gptj = GPT4All("ggml-mpt-7b-chat.bin")

    # Run model on prompt
    messages = [{"role": "user", "content": "from now on, respond only in Chinese.\nHello"}]
    response = gptj.chat_completion(messages, streaming=False)
    print(response['choices'][0]['message']['content'].decode('utf-8', errors='replace'))

if __name__ == '__main__':
    main()

here are the changes I made in pyllmodel.py:

diff --git a/gpt4all-bindings/python/gpt4all/pyllmodel.py b/gpt4all-bindings/python/gpt4all/pyllmodel.py
index 6117c9f..8319fda 100644
--- a/gpt4all-bindings/python/gpt4all/pyllmodel.py
+++ b/gpt4all-bindings/python/gpt4all/pyllmodel.py
@@ -195,7 +203,9 @@ class LLModel:

         old_stdout = sys.stdout 

-        stream_processor = DualStreamProcessor()
+        #stream_processor = DualStreamProcessor()
+        import io
+        stream_processor = io.BytesIO()

         if streaming:
             stream_processor.stream = sys.stdout
@@ -229,7 +239,9 @@ class LLModel:
         # Force new line
         print()

-        return stream_processor.output
+        #return stream_processor.output
+        stream_processor.seek(0)
+        return stream_processor.read()  # read all the bytes from the start

     # Empty prompt callback
     @staticmethod
@@ -239,7 +251,8 @@ class LLModel:
     # Empty response callback method that just prints response to be collected
     @staticmethod
     def _response_callback(token_id, response):
-        print(response.decode('utf-8'))
+        #print(response.decode('utf-8', errors='replace'))
+        sys.stdout.write(response)  # now just writes to the BytesIO buffer
         return True

     # Empty recalculate callback

and I also had to comment out two lines in gpt4all.py:

diff --git a/gpt4all-bindings/python/gpt4all/gpt4all.py b/gpt4all-bindings/python/gpt4all/gpt4all.py
index f24ee22..243f197 100644
--- a/gpt4all-bindings/python/gpt4all/gpt4all.py
+++ b/gpt4all-bindings/python/gpt4all/gpt4all.py
@@ -211,8 +212,8 @@ class GPT4All():

         response = self.model.generate(full_prompt, streaming=streaming, **generate_kwargs)

-        if verbose and not streaming:
-            print(response)
+        #if verbose and not streaming:
+        #    print(response)

         response_dict = {
             "model": self.model.model_name,

That's of course anything but user friendly, but for now it's better than nothing.

If you want to try yourself but don't know what to do with those things:

Save the first one as example.py
For the other two, I'm assuming you cloned the repository with git. Save them in separate files, for example as changes1.diff and changes2.diff. Then go to the repository's base directory and run git apply changes1.diff changes2.diff. See this StackOverflow question for more information.
then run python3 example.py