h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.41k stars 1.25k forks source link

Does not support multilingual output #703

Closed babytdream closed 1 year ago

babytdream commented 1 year ago

image Hello!Is there any improvement plan?

pseudotensor commented 1 year ago

This is only up to the model and your local encodings setup. Multilingual works fine, e.g. llama2 models do quite well despite being mostly english trained.

See for example: https://github.com/h2oai/h2ogpt/issues/552

You can see it work on https://gpt.h2o.ai just fine.

Some (likely not needed) subtle notes on prompting, however: https://github.com/h2oai/h2ogpt/issues/546 but this isn't usually required.

I think more likely your locale settings are not quite right on the system to support multilingual.

babytdream commented 1 year ago

@pseudotensor This is my recommend "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate.py --base_model=/data/model/h2ogpt-4096-llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True" How to modify locale settings, can you explain it in detail, thanks!

pseudotensor commented 1 year ago

I see, 70b is even better than rest, but even 7b llama2 GGML does things right:

image

pseudotensor commented 1 year ago

Both the original poster and you have the same issue. I'm unclear on what the problem is exactly.

I'm unsure if it's browser related etc.

What if you use this website: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat

and put in: "Translate this sentence to Chinese and Japanese: "Who are you?" and then answer it in those languages."

I get:

image

Does it work for you?

babytdream commented 1 year ago

In “https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat”,it can work.

image

pseudotensor commented 1 year ago

At least then your browser is ok, but something is off with your installation or settings locally. I'm unclear what it is.

What if you try that exact question in your local install of h2oGPT? Does it work or not? Does it only sometimes happen, or maybe only when doing query on docs, etc.?

lesong36 commented 1 year ago

I had same issue on my local installation.

babytdream commented 1 year ago

I dont't upload any PDF.This is in my local h2ogpt: recommend "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate.py --base_model=/data/model/h2ogpt-4096-llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True" image

pseudotensor commented 1 year ago

Ok so some characters are rendered correctly, but others not.

pseudotensor commented 1 year ago

Are you using docker or bare metal when you run?

pseudotensor commented 1 year ago

@achraf-mer @ChathurindaRanasinghe Any thoughts? Not docker related, but you guys may have some insights. Thanks!

babytdream commented 1 year ago

Not docker.I use miniconda to create a environment and install "requirement.txt" in the environment.

pseudotensor commented 1 year ago

linux, windows, or mac? I presume linux?

In case helps, here are all the installed packages on gpt.h2o.ai. Maybe something will ring a bell.

apt list --installed > apt_installed.log

gives: apt_installed.log

Font related:

fontconfig-config/focal,now 2.13.1-2ubuntu3 all [installed,automatic]
fontconfig/focal,now 2.13.1-2ubuntu3 amd64 [installed,automatic]
fonts-crosextra-caladea/focal,now 20130214-2 all [installed,automatic]
fonts-crosextra-carlito/focal,now 20130920-1 all [installed,automatic]
fonts-dejavu-core/focal,now 2.37-1 all [installed,automatic]
fonts-dejavu-extra/focal,now 2.37-1 all [installed,automatic]
fonts-dejavu/focal,now 2.37-1 all [installed,automatic]
fonts-liberation2/focal,now 2.1.0-1 all [installed,automatic]
fonts-liberation/focal,now 1:1.07.4-11 all [installed,automatic]
fonts-linuxlibertine/focal,now 5.3.0-4 all [installed,automatic]
fonts-lyx/focal,now 2.3.4.2-2 all [installed,automatic]
fonts-noto-core/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-extra/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-mono/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-ui-core/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-opensymbol/focal-updates,focal-security,now 2:102.11+LibO6.4.7-0ubuntu0.20.04.8 all [installed,automatic]
fonts-sil-gentium-basic/focal,now 1.102-1 all [installed,automatic]
fonts-sil-gentium/focal,now 20081126:1.03-2 all [installed,automatic]
fonts-ubuntu-console/focal,now 0.83-4ubuntu1 all [installed,automatic]
libfont-afm-perl/focal,now 1.20-2 all [installed,automatic]
libfontconfig1/focal,now 2.13.1-2ubuntu3 amd64 [installed,automatic]
libfontenc1/focal,now 1:1.1.4-0ubuntu1 amd64 [installed,automatic]
libxfont2/focal,now 1:2.0.3-1 amd64 [installed,automatic]
xfonts-base/focal,now 1:1.0.5 all [installed,automatic]
xfonts-encodings/focal,now 1:1.0.5-0ubuntu1 all [installed,automatic]
xfonts-utils/focal,now 1:7.7+6 amd64 [installed,automatic]
pseudotensor commented 1 year ago

And here's gpt.h2o.ai's locale setting, which seems not helpful/related since all english:

(h2ollm) ubuntu@cloudvm:~/h2ogpt$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
babytdream commented 1 year ago

use in linux--ubuntu image

image

lesong36 commented 1 year ago

I tested on both Linux and windows 11 with several model and got the same result above.
1

pseudotensor commented 1 year ago

I was able to get this model to fail on my linux box:

python generate.py --base_model=TheBloke/Llama-2-7b-Chat-GPTQ --load_gptq="gptq_model-4bit-128g" --use_safetensors=True --prompt_type=llama2 --save_dir='7bgptq4bit`

image

babytdream commented 1 year ago

apt list --installed > apt_installed.log result: apt_installed.log

pseudotensor commented 1 year ago

I get same problem with 7b HF model:

image

So seems gpt.h2o.ai has something installed that I don't either.

pseudotensor commented 1 year ago

Shows related problem with same odd char: https://unix.stackexchange.com/questions/149348/how-to-display-chinese-characters-correctly-on-remote-red-hat-machine

pseudotensor commented 1 year ago

Also related: https://stackoverflow.com/questions/52101602/ubuntu-display-chinese-characters-encoding-issue

pseudotensor commented 1 year ago

I tried going to fully default theme, just demo = gr.Blocks() and still happens. Also tried different gradio_offline_level 1 and 2, still happens

pseudotensor commented 1 year ago

\ufffd is a replacement character in unicode when there's no respresentation in unicode.

I can see on gpt system normal looking unicode generated, like:

Translate this sentence to Chinese and Japanese: \"Who are you?\" and then answer it in those languages.<|endoftext|><|answer|>", "text": "Sure, here are the translations:\n\n1. \"\u4f60\u662f\u8c01\uff1f\" (Chinese)\n2. \"\u304a\u308c\u304c\u8ab0\uff1f\" (Japanese)\n\nAnd here are the answers:\n\n1. \"\u6211\u662f\u4e9a\u9a6c\u900a\u8bed\u97f3\u63a7\u5236\u7cfb\u7edf\u3002\" (Chinese)\n2. \"\u79c1\u306f\u30a2\u30de\u30be\u30f3\u97f3\u58f0\u5236\u5fa1\u30b7\u30b9\u30c6\u30e0\u3067\u3059\u3002\

whereas on my local system I get those replacement codes:

Translate this sentence to Chinese and Japanese: \"Who are you?\" and then answer it in those languages. [/INST]", "text": " Sure, I'd be happy to help! Here are the translations of \"Who are you?\" into Chinese and Japanese:\n\nChinese: \ufffd\ufffd\u4f60 (zh\u00e8ng sh\u00ec n\u01d0)\nJapanese: \ufffd\ufffd\u3089\u306a\u308a (dare ka naru)\n\nNow, here are the answers in the same languages:\n\nChinese: \u6211\ufffd\ufffd.. (w\u01d2 ji\u00e0o...) (I am... )\nJapanese: \ufffd\ufffd... (Watashi wa...) (My name is... )
pseudotensor commented 1 year ago

I've tried installing tons of font or other packages in Ubuntu, still no luck after rebooting etc.

Main other difference between https://gpt.h2o.ai and local is gpt uses TGI server and locally I'm using torch or GGML. From history.json it seems that the problem is not UI, but actual generation.

pseudotensor commented 1 year ago

Seems torch related, because GGML does not have issues locally:

image

pseudotensor commented 1 year ago

openai locally also is fine:

image

pseudotensor commented 1 year ago

Ok fixed now. Just streamer handling of char-by-char output was issue:

image

DuuuDik commented 1 year ago

Great job with problem solving, @pseudotensor! But what would be the general recomendation for solving this issue? Avoid torch, use ggml?

FreshLucas-git commented 1 year ago

@pseudotensor I also came across the same problem. Could you describe more detail how to fix this issue?

pseudotensor commented 1 year ago

@DuuuDik @FreshLucas-git it's fixed in h2ogpt, the commit is shown above. There's nothing you have to do except use latest h2ogpt.

Mins0o commented 1 year ago

Hello, I am still having output problem with unicode characters. I am using exllama model: TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GPTQ

I inserted a print line in utils_langchain.py#L31-L33-

    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        """Run on new LLM token. Only available when streaming is enabled."""
        print(f"streamer token input, unicode: {token}, {token.encode('unicode_escape')}")
        self.text_queue.put(token)

-to observe the outcome from the chain.

streamer token input, unicode:  ", b' "'

streamer token input, unicode: �, b'\\ufffd'
 "
streamer token input, unicode: �, b'\\ufffd'
 "�
streamer token input, unicode: , b''
 "��
streamer token input, unicode: 무, b'\\ubb34'
 "��
streamer token input, unicode:  , b' '
 "��무
streamer token input, unicode: �, b'\\ufffd'
 "��무
streamer token input, unicode: �, b'\\ufffd'

This is some of the internal variables

stream_output: True
inference_server:
use_openai_model: False
model.is_exlama: True hasattr(model, 'is_exlama'): True
tokenizer: <src.llm_exllama.H2OExLlamaTokenizer object at 0x7f9153000100>
streamer: <utils_langchain.StreamingGradioCallbackHandler object at 0x7f913c3bd270>
langchaing_action: Query
use_template: True
not use_docs_planned: True 

The question was "한국어로 대답해줘" which means "Answer me in Korean"

Do you think this is because of the tokenizer? <src.llm_exllama.H2OExLlamaTokenizer object at 0x7f9153000100>

Mins0o commented 1 year ago

And when I use a GGML (llama-2-13b-chat.ggmlv3.q5_K_M.bin), it just omits the characters.

"��무 ��은"을 의미하는 "too many"은 "��무 ��이"라는 ����니다.

in ggml

"무 은"을 의미하는 "too many"은 "무 이"라는 니다.

Seems torch related, because GGML does not have issues locally:

image

Your comment above also, about the GGML, also is incorrect. It is missing some characters.

Mins0o commented 1 year ago

h2oai/h2ogpt-gm-oasst1-multilang-2048-falcon-7b  streamer: <gen.H2OTextIteratorStreamer object at 0x7f8ebe3a8070>

This one works well.

韓国語で話しましょう!

こんにちは。私の名前はOpen Assistantです。何かお手元できるので、お試しください。

pseudotensor commented 1 year ago

This is gpt.h2o.ai. The answers are not good, but no problems rendering the characters.

image

pseudotensor commented 1 year ago

@Mins0o So your conclusion is that it's not actually rendering correctly but just dropping some characters?

pseudotensor commented 1 year ago

@Mins0o The issue I found was that the streamer was dumping the token even if it actually needed to wait and combine that token with another to render the correct character. I thought this is fixed, but if you have a well defined case you can show for gpt.h2o.ai that is incorrect I can take a look.

Thanks!

Mins0o commented 1 year ago

This is gpt.h2o.ai. The answers are not good, but no problems rendering the characters.

image

These all look good.
I was running mine locally inside a docker container.

Mins0o commented 1 year ago

@Mins0o So your conclusion is that it's not actually rendering correctly but just dropping some characters?

For the ggml, yes. For exLlama models, it outputs the garbage code.
But now I think about it, I am starting to think that it's because it's a none h2oai model...?

I'm sorry to bother you while you look quite busy...
I will look into this more, as it is my interest, I am capable of python until some degree. Any information you can give me would be a great help 😀
(Though I can not put too much spare time on this 😥)

Mins0o commented 1 year ago

I found out the decode in the exllama.py actually outputs the correct characters. https://github.com/h2oai/h2ogpt/blob/3879f0ea6e31b8d11cfc4391615cc47483067ca9/src/llm_exllama.py#L316

prompt:  안녕!!

0
stuff: [system prompt] USER: 안녕!! ASSISTANT:
cursor_head:176
cursor_tail: 177
text_chunk:

1
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안
cursor_head:177
cursor_tail: 178
text_chunk: 안
text1:  안

2
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안�
cursor_head:178
cursor_tail: 179
text_chunk: �
text1:  안�

3
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안��
cursor_head:179
cursor_tail: 180
text_chunk: �
text1:  안��

4
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안녕
cursor_head:180
cursor_tail: 179
text_chunk:
text1:  안��

5
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안녕하
cursor_head:179
cursor_tail: 180
text_chunk: 하
text1:  안��하

Look at number 4. I was paying attention to this part of the code, and noticed there is a weird inversion between the cursor_head and the cursor_tail when there is a malfunction.
You can also see that the decoder output is correct, but the text_chunk errors out.
https://github.com/h2oai/h2ogpt/blob/3879f0ea6e31b8d11cfc4391615cc47483067ca9/src/llm_exllama.py#L315-L319
I am not sure why, but maybe the "half" of certain unicode characters confuses the len()?? I don't know yet, but I will investigate more.

Mins0o commented 1 year ago

Ah, I know why, it's the same thing as you have mentioned.

dumping the token even if it actually needed to wait and combine that token with another to render the correct character.

Mins0o commented 1 year ago
> 안녕!!

>> 안녕하세요! 무엇을 도와드릴까요?

> 안녕!!

>> 안녕하세요! 무엇을 도와드릴까요?

> 你中文

>> 你好! 您需要什么帮助呢?

> 你也会中文吗

>> 是的,我可以用中文和英语来回答您的问题。请问您有什么问题我可以帮您解答?

> 日本語もしゃべられるの?

>>はい、日本語でも話せます。お手をどうぞ。

> 太好了!

>> よくできました!ありがとうございます。何かお力になることがありましょうか?

>「日」もし、わたしが日本語と【中】中文 [한] 그리고 한국어를 섞어서 「日」使えばどうするの?

>>「日」は、日本語で一般的な表現法です。中国語や韓国語でも同じように使えます。例えば、「今日は美味しいねんだっけ?」と言うと、中国人や韓国人にも理解さibleです。ただ、注意する必要があります。日本語で「日」を使う場合、他の言語ではその言語に限定された意味を持つ場合があります。例えば、中国語で「日」は「日子」を意味します。

I think I fixed it (kinda). While streaming, it still shows the broken characters, with the correct characters. It fixes it after it finishes the response. (fixed)

Mins0o commented 1 year ago

790

Mins0o commented 1 year ago

Added another condition to block pushing to the queue when the text_chunk is a broken unicode character. f31394a Do not put broken characters into the queue

#llm_exllama.py
Ln11    import os, glob
Ln12   
Ln13   BROKEN_UNICODE = b'\\ufffd'.decode('unicode_escape')
Ln14   
Ln15   class H2OExLlamaTokenizer(ExLlamaTokenizer):

                                   ...
                                   ...
                                   ...

Ln352           elif status == self.MatchStatus.NO_MATCH:
Ln353               if text_callback and not (text_chunk == BROKEN_UNICODE):  # Here
Ln354                   text_callback(text_chunk)
Mins0o commented 1 year ago

It looks beatiful! image

Translate these Korean sentences:

"
안녕!!

안녕하세요! 무엇을 도와드릴까요?

드디어 한국말을 유창하게하는구나!

그렇습니다! 언제든지 도움이 필요하시면 답변해드리겠습니다.

너무 반가워 ㅠㅠㅠ

근데 좀 더 자신감을 가지고 해주셔도 될거야! 그래도 죄송합니다만, 언제든지 도움이 필요하시면 답변해드리겠습니다.

그래그래

좋은 일이 있길 바랍니다. 어떤 질문이나 문제가 있으신가요? "

image

Yeah... this model is not for multilingual interaction... lol

pseudotensor commented 1 year ago

Awesome thanks!

Mins0o commented 1 year ago

btw, I didn't look into the local llama ggml yet... Maybe some time later...