Closed babytdream closed 1 year ago
This is only up to the model and your local encodings setup. Multilingual works fine, e.g. llama2 models do quite well despite being mostly english trained.
See for example: https://github.com/h2oai/h2ogpt/issues/552
You can see it work on https://gpt.h2o.ai just fine.
Some (likely not needed) subtle notes on prompting, however: https://github.com/h2oai/h2ogpt/issues/546 but this isn't usually required.
I think more likely your locale settings are not quite right on the system to support multilingual.
@pseudotensor This is my recommend "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate.py --base_model=/data/model/h2ogpt-4096-llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True" How to modify locale settings, can you explain it in detail, thanks!
I see, 70b is even better than rest, but even 7b llama2 GGML does things right:
Both the original poster and you have the same issue. I'm unclear on what the problem is exactly.
I'm unsure if it's browser related etc.
What if you use this website: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat
and put in: "Translate this sentence to Chinese and Japanese: "Who are you?" and then answer it in those languages."
I get:
Does it work for you?
At least then your browser is ok, but something is off with your installation or settings locally. I'm unclear what it is.
What if you try that exact question in your local install of h2oGPT? Does it work or not? Does it only sometimes happen, or maybe only when doing query on docs, etc.?
I had same issue on my local installation.
I dont't upload any PDF.This is in my local h2ogpt: recommend "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python generate.py --base_model=/data/model/h2ogpt-4096-llama2-70b-chat/ --prompt_type=llama2 --use_gpu_id=False --share=True"
Ok so some characters are rendered correctly, but others not.
Are you using docker or bare metal when you run?
@achraf-mer @ChathurindaRanasinghe Any thoughts? Not docker related, but you guys may have some insights. Thanks!
Not docker.I use miniconda to create a environment and install "requirement.txt" in the environment.
linux, windows, or mac? I presume linux?
In case helps, here are all the installed packages on gpt.h2o.ai. Maybe something will ring a bell.
apt list --installed > apt_installed.log
gives: apt_installed.log
Font related:
fontconfig-config/focal,now 2.13.1-2ubuntu3 all [installed,automatic]
fontconfig/focal,now 2.13.1-2ubuntu3 amd64 [installed,automatic]
fonts-crosextra-caladea/focal,now 20130214-2 all [installed,automatic]
fonts-crosextra-carlito/focal,now 20130920-1 all [installed,automatic]
fonts-dejavu-core/focal,now 2.37-1 all [installed,automatic]
fonts-dejavu-extra/focal,now 2.37-1 all [installed,automatic]
fonts-dejavu/focal,now 2.37-1 all [installed,automatic]
fonts-liberation2/focal,now 2.1.0-1 all [installed,automatic]
fonts-liberation/focal,now 1:1.07.4-11 all [installed,automatic]
fonts-linuxlibertine/focal,now 5.3.0-4 all [installed,automatic]
fonts-lyx/focal,now 2.3.4.2-2 all [installed,automatic]
fonts-noto-core/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-extra/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-mono/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-noto-ui-core/focal-updates,now 20200323-1build1~ubuntu20.04.1 all [installed,automatic]
fonts-opensymbol/focal-updates,focal-security,now 2:102.11+LibO6.4.7-0ubuntu0.20.04.8 all [installed,automatic]
fonts-sil-gentium-basic/focal,now 1.102-1 all [installed,automatic]
fonts-sil-gentium/focal,now 20081126:1.03-2 all [installed,automatic]
fonts-ubuntu-console/focal,now 0.83-4ubuntu1 all [installed,automatic]
libfont-afm-perl/focal,now 1.20-2 all [installed,automatic]
libfontconfig1/focal,now 2.13.1-2ubuntu3 amd64 [installed,automatic]
libfontenc1/focal,now 1:1.1.4-0ubuntu1 amd64 [installed,automatic]
libxfont2/focal,now 1:2.0.3-1 amd64 [installed,automatic]
xfonts-base/focal,now 1:1.0.5 all [installed,automatic]
xfonts-encodings/focal,now 1:1.0.5-0ubuntu1 all [installed,automatic]
xfonts-utils/focal,now 1:7.7+6 amd64 [installed,automatic]
And here's gpt.h2o.ai's locale setting, which seems not helpful/related since all english:
(h2ollm) ubuntu@cloudvm:~/h2ogpt$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
use in linux--ubuntu
I tested on both Linux and windows 11 with several model and got the same result above.
I was able to get this model to fail on my linux box:
python generate.py --base_model=TheBloke/Llama-2-7b-Chat-GPTQ --load_gptq="gptq_model-4bit-128g" --use_safetensors=True --prompt_type=llama2 --save_dir='7bgptq4bit`
apt list --installed > apt_installed.log result: apt_installed.log
I get same problem with 7b HF model:
So seems gpt.h2o.ai has something installed that I don't either.
Shows related problem with same odd char: https://unix.stackexchange.com/questions/149348/how-to-display-chinese-characters-correctly-on-remote-red-hat-machine
I tried going to fully default theme, just demo = gr.Blocks()
and still happens. Also tried different gradio_offline_level 1 and 2, still happens
\ufffd
is a replacement character in unicode when there's no respresentation in unicode.
I can see on gpt system normal looking unicode generated, like:
Translate this sentence to Chinese and Japanese: \"Who are you?\" and then answer it in those languages.<|endoftext|><|answer|>", "text": "Sure, here are the translations:\n\n1. \"\u4f60\u662f\u8c01\uff1f\" (Chinese)\n2. \"\u304a\u308c\u304c\u8ab0\uff1f\" (Japanese)\n\nAnd here are the answers:\n\n1. \"\u6211\u662f\u4e9a\u9a6c\u900a\u8bed\u97f3\u63a7\u5236\u7cfb\u7edf\u3002\" (Chinese)\n2. \"\u79c1\u306f\u30a2\u30de\u30be\u30f3\u97f3\u58f0\u5236\u5fa1\u30b7\u30b9\u30c6\u30e0\u3067\u3059\u3002\
whereas on my local system I get those replacement codes:
Translate this sentence to Chinese and Japanese: \"Who are you?\" and then answer it in those languages. [/INST]", "text": " Sure, I'd be happy to help! Here are the translations of \"Who are you?\" into Chinese and Japanese:\n\nChinese: \ufffd\ufffd\u4f60 (zh\u00e8ng sh\u00ec n\u01d0)\nJapanese: \ufffd\ufffd\u3089\u306a\u308a (dare ka naru)\n\nNow, here are the answers in the same languages:\n\nChinese: \u6211\ufffd\ufffd.. (w\u01d2 ji\u00e0o...) (I am... )\nJapanese: \ufffd\ufffd... (Watashi wa...) (My name is... )
I've tried installing tons of font or other packages in Ubuntu, still no luck after rebooting etc.
Main other difference between https://gpt.h2o.ai and local is gpt uses TGI server and locally I'm using torch or GGML. From history.json it seems that the problem is not UI, but actual generation.
Seems torch related, because GGML does not have issues locally:
openai locally also is fine:
Ok fixed now. Just streamer handling of char-by-char output was issue:
Great job with problem solving, @pseudotensor! But what would be the general recomendation for solving this issue? Avoid torch, use ggml?
@pseudotensor I also came across the same problem. Could you describe more detail how to fix this issue?
@DuuuDik @FreshLucas-git it's fixed in h2ogpt, the commit is shown above. There's nothing you have to do except use latest h2ogpt.
Hello, I am still having output problem with unicode characters. I am using exllama model: TheBloke/WizardLM-1.0-Uncensored-Llama2-13B-GPTQ
I inserted a print
line in utils_langchain.py#L31-L33-
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
"""Run on new LLM token. Only available when streaming is enabled."""
print(f"streamer token input, unicode: {token}, {token.encode('unicode_escape')}")
self.text_queue.put(token)
-to observe the outcome from the chain.
streamer token input, unicode: ", b' "'
streamer token input, unicode: �, b'\\ufffd'
"
streamer token input, unicode: �, b'\\ufffd'
"�
streamer token input, unicode: , b''
"��
streamer token input, unicode: 무, b'\\ubb34'
"��
streamer token input, unicode: , b' '
"��무
streamer token input, unicode: �, b'\\ufffd'
"��무
streamer token input, unicode: �, b'\\ufffd'
This is some of the internal variables
stream_output: True
inference_server:
use_openai_model: False
model.is_exlama: True hasattr(model, 'is_exlama'): True
tokenizer: <src.llm_exllama.H2OExLlamaTokenizer object at 0x7f9153000100>
streamer: <utils_langchain.StreamingGradioCallbackHandler object at 0x7f913c3bd270>
langchaing_action: Query
use_template: True
not use_docs_planned: True
The question was "한국어로 대답해줘" which means "Answer me in Korean"
Do you think this is because of the tokenizer? <src.llm_exllama.H2OExLlamaTokenizer object at 0x7f9153000100>
And when I use a GGML (llama-2-13b-chat.ggmlv3.q5_K_M.bin), it just omits the characters.
"��무 ��은"을 의미하는 "too many"은 "��무 ��이"라는 ����니다.
in ggml
"무 은"을 의미하는 "too many"은 "무 이"라는 니다.
Seems torch related, because GGML does not have issues locally:
Your comment above also, about the GGML, also is incorrect. It is missing some characters.
h2oai/h2ogpt-gm-oasst1-multilang-2048-falcon-7b streamer: <gen.H2OTextIteratorStreamer object at 0x7f8ebe3a8070>
This one works well.
韓国語で話しましょう!
こんにちは。私の名前はOpen Assistantです。何かお手元できるので、お試しください。
This is gpt.h2o.ai. The answers are not good, but no problems rendering the characters.
@Mins0o So your conclusion is that it's not actually rendering correctly but just dropping some characters?
@Mins0o The issue I found was that the streamer was dumping the token even if it actually needed to wait and combine that token with another to render the correct character. I thought this is fixed, but if you have a well defined case you can show for gpt.h2o.ai that is incorrect I can take a look.
Thanks!
This is gpt.h2o.ai. The answers are not good, but no problems rendering the characters.
These all look good.
I was running mine locally inside a docker container.
@Mins0o So your conclusion is that it's not actually rendering correctly but just dropping some characters?
For the ggml, yes. For exLlama models, it outputs the garbage code.
But now I think about it, I am starting to think that it's because it's a none h2oai model...?
I'm sorry to bother you while you look quite busy...
I will look into this more, as it is my interest, I am capable of python until some degree. Any information you can give me would be a great help 😀
(Though I can not put too much spare time on this 😥)
I found out the decode in the exllama.py actually outputs the correct characters. https://github.com/h2oai/h2ogpt/blob/3879f0ea6e31b8d11cfc4391615cc47483067ca9/src/llm_exllama.py#L316
prompt: 안녕!!
0
stuff: [system prompt] USER: 안녕!! ASSISTANT:
cursor_head:176
cursor_tail: 177
text_chunk:
1
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안
cursor_head:177
cursor_tail: 178
text_chunk: 안
text1: 안
2
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안�
cursor_head:178
cursor_tail: 179
text_chunk: �
text1: 안�
3
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안��
cursor_head:179
cursor_tail: 180
text_chunk: �
text1: 안��
4
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안녕
cursor_head:180
cursor_tail: 179
text_chunk:
text1: 안��
5
stuff: [system prompt] USER: 안녕!! ASSISTANT: 안녕하
cursor_head:179
cursor_tail: 180
text_chunk: 하
text1: 안��하
Look at number 4.
I was paying attention to this part of the code, and noticed there is a weird inversion between the cursor_head
and the cursor_tail
when there is a malfunction.
You can also see that the decoder output is correct, but the text_chunk
errors out.
https://github.com/h2oai/h2ogpt/blob/3879f0ea6e31b8d11cfc4391615cc47483067ca9/src/llm_exllama.py#L315-L319
I am not sure why, but maybe the "half" of certain unicode characters confuses the len()
?? I don't know yet, but I will investigate more.
Ah, I know why, it's the same thing as you have mentioned.
dumping the token even if it actually needed to wait and combine that token with another to render the correct character.
> 안녕!!
>> 안녕하세요! 무엇을 도와드릴까요?
> 안녕!!
>> 안녕하세요! 무엇을 도와드릴까요?
> 你中文
>> 你好! 您需要什么帮助呢?
> 你也会中文吗
>> 是的,我可以用中文和英语来回答您的问题。请问您有什么问题我可以帮您解答?
> 日本語もしゃべられるの?
>>はい、日本語でも話せます。お手をどうぞ。
> 太好了!
>> よくできました!ありがとうございます。何かお力になることがありましょうか?
>「日」もし、わたしが日本語と【中】中文 [한] 그리고 한국어를 섞어서 「日」使えばどうするの?
>>「日」は、日本語で一般的な表現法です。中国語や韓国語でも同じように使えます。例えば、「今日は美味しいねんだっけ?」と言うと、中国人や韓国人にも理解さibleです。ただ、注意する必要があります。日本語で「日」を使う場合、他の言語ではその言語に限定された意味を持つ場合があります。例えば、中国語で「日」は「日子」を意味します。
I think I fixed it (kinda). While streaming, it still shows the broken characters, with the correct characters. It fixes it after it finishes the response. (fixed)
Added another condition to block pushing to the queue when the text_chunk
is a broken unicode character.
f31394a Do not put broken characters into the queue
#llm_exllama.py
Ln11 import os, glob
Ln12
Ln13 BROKEN_UNICODE = b'\\ufffd'.decode('unicode_escape')
Ln14
Ln15 class H2OExLlamaTokenizer(ExLlamaTokenizer):
...
...
...
Ln352 elif status == self.MatchStatus.NO_MATCH:
Ln353 if text_callback and not (text_chunk == BROKEN_UNICODE): # Here
Ln354 text_callback(text_chunk)
It looks beatiful!
Translate these Korean sentences:
"
안녕!!
안녕하세요! 무엇을 도와드릴까요?
드디어 한국말을 유창하게하는구나!
그렇습니다! 언제든지 도움이 필요하시면 답변해드리겠습니다.
너무 반가워 ㅠㅠㅠ
근데 좀 더 자신감을 가지고 해주셔도 될거야! 그래도 죄송합니다만, 언제든지 도움이 필요하시면 답변해드리겠습니다.
그래그래
좋은 일이 있길 바랍니다. 어떤 질문이나 문제가 있으신가요? "
Yeah... this model is not for multilingual interaction... lol
Awesome thanks!
btw, I didn't look into the local llama ggml yet... Maybe some time later...
Hello!Is there any improvement plan?