oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.58k stars 5.2k forks source link

Gibberish on Mistral when n_ctx is >=32768, gguf #4933

Closed DutchEllie closed 6 months ago

DutchEllie commented 9 months ago

Describe the bug

I am on the dev branch right now! Very important to note.

I loaded mistral-7b-instruct-v0.1.Q5_K_M.gguf and mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf using llama.cpp and offloading some layers onto my RX 7900 XTX. When the n_ctx is set to 32768 (or presumably higher as well) the output when using the chat is gibberish. It also happens using ChatML and Alpaca in default and keeps happening anywhere until I reload the model. Strangely enough, when a system prompt is not provided in default, it works, but when a system prompt is provided, it breaks? Also, even though I set the n_ctx to this number, I am not actually providing that many tokens!! I am working with just a couple hundred tokens at most.

Anyway, setting the n_ctx to anything lower than 32768 will restore functionality.

This does not seem to happen when I use a different model though? I also loaded up tinyllama and set its n_ctx to 32768, sent out similar prompts, it works.

Is there an existing issue for this?

Reproduction

  1. Be using AMD RX 7900 XTX on Arch Linux
  2. Install dev branch like you normally would (using requirements_amd.txt)
  3. Load mistral-7b-instruct-v0.1.Q5_K_M.gguf and set the n_ctx to 32768.
  4. Open the chat and talk to the AI. You can also go to the default tab, load a ChatML template and let it generate. Same results.
  5. After this has been done once, every future attempt at generating, even with prompts that normally don't break the AI, will result in gibberish. That is until the load is reloaded.

Screenshot

image

Logs

llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 2167.35 MiB
llama_new_context_with_model: VRAM scratch buffer: 2164.04 MiB
llama_new_context_with_model: total VRAM used: 19346.72 MiB (model: 17182.69 MiB, context: 2164.04 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-12-15 09:17:42 INFO:LOADER: llama.cpp
2023-12-15 09:17:42 INFO:TRUNCATION LENGTH: 32768
2023-12-15 09:17:42 INFO:INSTRUCTION TEMPLATE: Alpaca
2023-12-15 09:17:42 INFO:Loaded the model in 7.21 seconds.

llama_print_timings:        load time =     911.83 ms
llama_print_timings:      sample time =       3.30 ms /    13 runs   (    0.25 ms per token,  3934.62 tokens per second)
llama_print_timings: prompt eval time =     911.61 ms /    46 tokens (   19.82 ms per token,    50.46 tokens per second)
llama_print_timings:        eval time =     897.69 ms /    12 runs   (   74.81 ms per token,    13.37 tokens per second)
llama_print_timings:       total time =    1849.40 ms
Output generated in 2.22 seconds (4.95 tokens/s, 11 tokens, context 46, seed 1479347654)

nothing special really..

System Info

OS: Arch Linux 6.6.6-arch1-1  
GPU: AMD RX 7900 XTX  
Ooba branch: `dev`
RandomLegend commented 9 months ago

Maybe because you use the Alpaca instruction template?

DutchEllie commented 9 months ago

It happens using ChatML as well.

complexinteractive commented 9 months ago

Have you ever successfully used your Mixtral K quants? [https://cdn-uploads.huggingface.co/production/uploads/63ab1241ad514ca8d1430003/TvjEP14ps7ZUgJ-0-mhIX.png](People have been having a lot of problems with them, including me.) I would suggest trying a Q5_0 or something similar, as those quants seem to be working fine. I would also suggest trying the 4x7b model as I have not had nearly as many headaches with it.

DutchEllie commented 9 months ago

I only used K-quants for now, but despite having these issues and it being fixed by decreasing the context size, I have not tried the non-K versions. I will do that I guess. I never checked the quality of the generated text, so idk if it's any good.

Will check when I have the time

nonetrix commented 8 months ago

Same here

Edit: fixed for me now

github-actions[bot] commented 6 months ago

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.