oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.49k stars 5.19k forks source link

Ctransformers with gptneox unicode bug #3576

Closed mykeehu closed 11 months ago

mykeehu commented 1 year ago

Describe the bug

I loaded a GGML model in Hungarian on ctransformers with gptneox: https://huggingface.co/TheBloke/PULI-GPT-3SX-GGML It loaded successfully, but the accented characters gave an error:

gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'

Is there an existing issue for this?

Reproduction

  1. Download any GGML model from this link: https://huggingface.co/TheBloke/PULI-GPT-3SX-GGML
  2. Set this model to ctransformers with gptneox
  3. type this question: "Tudod mi az az árvíztűrő tükörfúrógép?"
  4. see error in console. The answer does not contain all accented vowels

Screenshot

image

image

Logs

Ez egy beszélgetés az asszisztensével. Ez egy számítógépes program, amelyet arra terveztek, hogy segítsen Önnek különböző feladatokban, például kérdésekre válaszoljon, ajánlásokat adjon, és segítsen a döntéshozatalban. Bármit kérdezhet tőle, amit csak akar, és ő mindent megtesz, hogy pontos és releváns információkat adjon Önnek.
Miki: Szia! Tudod mi az az árvíztűrő tükörfúrógép?
Assistant:
--------------------

gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'í'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'Â'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token '│'
gpt_tokenize: unknown token '├'
gpt_tokenize: unknown token 'ę'
Output generated in 6.61 seconds (3.78 tokens/s, 25 tokens, context 146, seed 1737163314)
2023-08-14 21:01:33 INFO:Loading TheBloke_PULI-GPT-3SX-GGML-41...
2023-08-14 21:01:33 INFO:ctransformers weights detected: I:\Textmodels\TheBloke_PULI-GPT-3SX-GGML-41\puli-gpt-3sx.ggmlv1.q4_1.bin
2023-08-14 21:01:36 INFO:Using ctransformers model_type: gptneox for I:\Textmodels\TheBloke_PULI-GPT-3SX-GGML-41\puli-gpt-3sx.ggmlv1.q4_1.bin
2023-08-14 21:01:36 INFO:Replaced attention with sdp_attention
2023-08-14 21:01:36 INFO:Loaded the model in 2.45 seconds.

System Info

i9-13900K
64 GB RAM
RTX 3060 12 GB
Windows 10
github-actions[bot] commented 11 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.