Slow response times and completely garbled outputs after updating to latest

Gyramuur commented 1 year ago

Describe the bug

Hi! So originally I was having a bit of a problem with trying to run local 13B models. I have 32 GB of RAM, 8 GB of VRAM, and an AMD Ryzen 7 3800 (8 cores at 3.9 GHz).

Since I do not have enough VRAM to run a 13B model, I had decided to use GGML with GPU offloading using the -n-gpu-layers command. Recently I went through a bit of a setup where I updated Oobabooga and in doing so had to re-enable GPU acceleration by reinstalling llama-cpp-python, over on this page: https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md#gpu-acceleration

After that, I was also getting an error with bitsandbytes saying it was installed without GPU support, however Oobabooga still said the GPU offloading was working. I had set n-gpu-layers to 28 and had about 7 GB in VRAM. With those settings, my model load looked like this:

llama.cpp: loading model from models\TheBloke_guanaco-13B-GGML-5_1\guanaco-13B.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 5005.45 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 28 layers to GPU
llama_model_load_internal: total VRAM used: 6866 MB
......................................................................
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

With this setup, with GPU offloading working and bitsandbytes complaining it wasn't installed right, I was getting a slow but fairly consistent ~2 tokens per second. I also decided to do something about the bitsandbytes error, and I found this issue on github where a solution (using pip to install torch-2.0) was posted: https://github.com/oobabooga/text-generation-webui/issues/1969

Unfortunately, it seems like fixing the bitsandbytes error is where I eternally screwed up, as after doing that, performance inside of Oobabooga was basically tanked. Outputs tended to be between 0.3 and 0.4 tokens a second, and on top of that, it takes minutes before the replies even start generating.

Recently I decided to try and fix it again, doing a fresh install with the latest one-click installer for Windows, then git pulling inside the directory to make it extra up-to-date, then going through the steps for installing llama-cpp-python. Now it's even worse somehow. Not only are generation times slow, but the model is outputting total garbage even with using the same settings as before. Here's what that looks like: https://snipboard.io/3FGprt.jpg (The bit before I ask "Why not?" was before I started messing with all this stuff)

Also dunno if it helps, but here's what was generated in console:

llama_print_timings:        load time = 30940.60 ms
llama_print_timings:      sample time =    46.62 ms /   200 runs   (    0.23 ms per token)
llama_print_timings: prompt eval time = 118771.28 ms /  1280 tokens (   92.79 ms per token)
llama_print_timings:        eval time = 47860.89 ms /   199 runs   (  240.51 ms per token)
llama_print_timings:       total time = 167738.28 ms
Output generated in 168.18 seconds (1.28 tokens/s, 215 tokens, context 1281, seed 1646446198)
Llama.generate: prefix-match hit

Any help is appreciated. :)

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Dunno, as this seems to be unique to me, but using the latest one-click installer, git pulling to latest, then manually re-enabling GPU support by installing llama-cpp-python so I could use GPU offloading is how I got here.

Screenshot

No response

Logs

There's no errors that show up in the console.

System Info

Entire DxDiag: https://pastebin.com/2bippBX4

IkariDevGIT commented 1 year ago

hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1:

pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir

Gyramuur commented 1 year ago

hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1:

pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir

This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that? This is my console output:

INFO:Loading TheBloke_guanaco-13B-GGML-5_1...
INFO:llama.cpp weights detected: models\TheBloke_guanaco-13B-GGML-5_1\guanaco-13B.ggmlv3.q5_1.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\TheBloke_guanaco-13B-GGML-5_1\guanaco-13B.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 5686.19 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 25 layers to GPU
llama_model_load_internal: total VRAM used: 5672 MB
..............................................................
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 13.69 seconds.

INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 27.54 seconds (1.71 tokens/s, 47 tokens, context 1290, seed 1815948497)
Llama.generate: prefix-match hit
Output generated in 28.93 seconds (1.38 tokens/s, 40 tokens, context 1816, seed 476650210)
Llama.generate: prefix-match hit
Output generated in 25.36 seconds (1.22 tokens/s, 31 tokens, context 1827, seed 1532482713)
Llama.generate: prefix-match hit
Output generated in 9.39 seconds (2.45 tokens/s, 23 tokens, context 1829, seed 433967337)
Llama.generate: prefix-match hit
Output generated in 8.41 seconds (2.74 tokens/s, 23 tokens, context 1829, seed 1981029695)
Llama.generate: prefix-match hit
Output generated in 34.43 seconds (1.02 tokens/s, 35 tokens, context 1831, seed 970550330)
Llama.generate: prefix-match hit
Output generated in 23.97 seconds (0.83 tokens/s, 20 tokens, context 1843, seed 1589169475)
Llama.generate: prefix-match hit
Output generated in 29.08 seconds (1.00 tokens/s, 29 tokens, context 1834, seed 432341947)
Llama.generate: prefix-match hit
Output generated in 32.17 seconds (0.78 tokens/s, 25 tokens, context 1841, seed 658542566)
Llama.generate: prefix-match hit
Output generated in 10.34 seconds (2.71 tokens/s, 28 tokens, context 1841, seed 860323678)
Llama.generate: prefix-match hit
Output generated in 25.09 seconds (1.24 tokens/s, 31 tokens, context 1823, seed 349124508)

jllllll commented 1 year ago

hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1: pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir

This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that?

The delay before generation is because it is loading the model. llama-cpp-python does not load the model until you try to generate with it. I don't know if this is how the webui has implemented it or simply how llama-cpp-python works.

Gyramuur commented 1 year ago

hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1: pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir

This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that?

The delay before generation is because it is loading the model. llama-cpp-python does not load the model until you try to generate with it. I don't know if this is how the webui has implemented it or simply how llama-cpp-python works.

Ah, okay. I think it'd be good if there was an option to keep a model loaded in memory so the user didn't have to wait every time, but at least the other issues are fixed. I'll mark this as closed; thanks, everyone. :D

oobabooga / text-generation-webui