oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.14k stars 5.26k forks source link

Random words get dropped when using ggml model to generate Chinese output #3750

Closed Touch-Night closed 1 year ago

Touch-Night commented 1 year ago

Describe the bug

所有的模型都能够很好地进行英文文本的生成。但是使用llama.cpp加载方式加载ggml模型的话,在生成中文文本时,输出会明显地看到有漏字的现象。我试过了多个不同的模型和其他前端,确定是text generation webui的问题。
Translation by Bing AI: All models are able to generate English text well. However, when using the llama.cpp loading method to load the ggml model, when generating Chinese text, the output will clearly see the phenomenon of missing characters. I have tried multiple different ggml models and other front-ends, and I am sure it is a problem with the text generation webui.

Is there an existing issue for this?

Reproduction

Load a Chinese ggml model and chat with it using Chinese, to make it respond in Chinese.

Screenshot

IMG_20230830_152408 IMG_20230830_152417

Logs

2023-08-30 15:16:29 INFO:llama.cpp weights detected: models\Chinese-Llama-2-7b.ggmlv3.q8_0.bin
2023-08-30 15:16:29 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\Chinese-Llama-2-7b.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 7196.46 MB (+ 1024.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
llama_model_load_internal: total VRAM used: 384 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
2023-08-30 15:16:29 INFO:Loaded the model in 2.10 seconds.

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =     5.36 ms /    27 runs   (    0.20 ms per token,  5036.37 tokens per second)
llama_print_timings: prompt eval time =  9809.39 ms /    38 tokens (  258.14 ms per token,     3.87 tokens per second)
llama_print_timings:        eval time =  4746.40 ms /    26 runs   (  182.55 ms per token,     5.48 tokens per second)
llama_print_timings:       total time = 14609.50 ms
Output generated in 14.94 seconds (0.94 tokens/s, 14 tokens, context 39, seed 1097171487)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =     2.39 ms /    11 runs   (    0.22 ms per token,  4610.23 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  1951.44 ms /    11 runs   (  177.40 ms per token,     5.64 tokens per second)
llama_print_timings:       total time =  1973.68 ms
Output generated in 2.32 seconds (4.32 tokens/s, 10 tokens, context 39, seed 1491365094)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =    11.96 ms /    60 runs   (    0.20 ms per token,  5018.40 tokens per second)
llama_print_timings: prompt eval time =  1835.95 ms /    27 tokens (   68.00 ms per token,    14.71 tokens per second)
llama_print_timings:        eval time = 10506.84 ms /    59 runs   (  178.08 ms per token,     5.62 tokens per second)
llama_print_timings:       total time = 12464.93 ms
Output generated in 12.81 seconds (3.20 tokens/s, 41 tokens, context 57, seed 483740782)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =    37.72 ms /   200 runs   (    0.19 ms per token,  5301.95 tokens per second)
llama_print_timings: prompt eval time =  2490.72 ms /    48 tokens (   51.89 ms per token,    19.27 tokens per second)
llama_print_timings:        eval time = 36205.38 ms /   199 runs   (  181.94 ms per token,     5.50 tokens per second)
llama_print_timings:       total time = 39114.64 ms
Output generated in 39.44 seconds (4.39 tokens/s, 173 tokens, context 119, seed 1236825292)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =     8.01 ms /    43 runs   (    0.19 ms per token,  5367.62 tokens per second)
llama_print_timings: prompt eval time =  3150.72 ms /   168 tokens (   18.75 ms per token,    53.32 tokens per second)
llama_print_timings:        eval time =  7716.85 ms /    42 runs   (  183.73 ms per token,     5.44 tokens per second)
llama_print_timings:       total time = 10952.67 ms
Output generated in 11.28 seconds (2.93 tokens/s, 33 tokens, context 292, seed 829804082)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9809.59 ms
llama_print_timings:      sample time =     0.39 ms /     2 runs   (    0.19 ms per token,  5154.64 tokens per second)
llama_print_timings: prompt eval time =  1570.48 ms /    24 tokens (   65.44 ms per token,    15.28 tokens per second)
llama_print_timings:        eval time =   185.22 ms /     1 runs   (  185.22 ms per token,     5.40 tokens per second)
llama_print_timings:       total time =  1759.40 ms
Output generated in 2.08 seconds (0.48 tokens/s, 1 tokens, context 324, seed 1203091822)

System Info

System: Windows 10 22H2 19045.3393
GPU: RTX 3050 Ti Laptop (Modified to 8GB VRAM)
Richard7656 commented 1 year ago

Using the new format "GGUF" instead of using the old format "GGML". I have test the language model of "GGUF" and these specific Chinese characters can be display correctly without missing characters in response side.

" GGUF" is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for "GGML", which is no longer supported by llama.cpp.

Another method is choose "llamacpp_HF" or "ctransformers" instead of "llama.cpp" as model loader to load the "GGML" language model file. It is useful as the quantity of "GGUF" format file in Hugging Face are still much fewer than the "GGML" file.

Touch-Night commented 1 year ago

Thank you for the information