Closed Gyramuur closed 1 year ago
hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1:
pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir
hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1:
pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir
This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that? This is my console output:
INFO:Loading TheBloke_guanaco-13B-GGML-5_1...
INFO:llama.cpp weights detected: models\TheBloke_guanaco-13B-GGML-5_1\guanaco-13B.ggmlv3.q5_1.bin
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\TheBloke_guanaco-13B-GGML-5_1\guanaco-13B.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5686.19 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 25 layers to GPU
llama_model_load_internal: total VRAM used: 5672 MB
..............................................................
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 13.69 seconds.
INFO:Loading the extension "gallery"...
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 27.54 seconds (1.71 tokens/s, 47 tokens, context 1290, seed 1815948497)
Llama.generate: prefix-match hit
Output generated in 28.93 seconds (1.38 tokens/s, 40 tokens, context 1816, seed 476650210)
Llama.generate: prefix-match hit
Output generated in 25.36 seconds (1.22 tokens/s, 31 tokens, context 1827, seed 1532482713)
Llama.generate: prefix-match hit
Output generated in 9.39 seconds (2.45 tokens/s, 23 tokens, context 1829, seed 433967337)
Llama.generate: prefix-match hit
Output generated in 8.41 seconds (2.74 tokens/s, 23 tokens, context 1829, seed 1981029695)
Llama.generate: prefix-match hit
Output generated in 34.43 seconds (1.02 tokens/s, 35 tokens, context 1831, seed 970550330)
Llama.generate: prefix-match hit
Output generated in 23.97 seconds (0.83 tokens/s, 20 tokens, context 1843, seed 1589169475)
Llama.generate: prefix-match hit
Output generated in 29.08 seconds (1.00 tokens/s, 29 tokens, context 1834, seed 432341947)
Llama.generate: prefix-match hit
Output generated in 32.17 seconds (0.78 tokens/s, 25 tokens, context 1841, seed 658542566)
Llama.generate: prefix-match hit
Output generated in 10.34 seconds (2.71 tokens/s, 28 tokens, context 1841, seed 860323678)
Llama.generate: prefix-match hit
Output generated in 25.09 seconds (1.24 tokens/s, 31 tokens, context 1823, seed 349124508)
hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1: pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir
This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that?
The delay before generation is because it is loading the model. llama-cpp-python does not load the model until you try to generate with it. I don't know if this is how the webui has implemented it or simply how llama-cpp-python works.
hey, i found a workaround: open cmd_windows.bat in oobabooga, then paste in these commands 1-by-1: pip uninstall -y llama-cpp-python set CMAKE_ARGS="-DLLAMA_CUBLAS=on" set FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git@4bcaa5293c8a7e4f00981516658fa3824c2f1633 --no-cache-dir
This has fixed the problem of the garbled outputs, and the generation speed is better, but the issue of it sitting on "is typing" for a while before it starts responding is still there. It also seems to vary somewhat wildly with its response times. Do you know if there's any way around that?
The delay before generation is because it is loading the model. llama-cpp-python does not load the model until you try to generate with it. I don't know if this is how the webui has implemented it or simply how llama-cpp-python works.
Ah, okay. I think it'd be good if there was an option to keep a model loaded in memory so the user didn't have to wait every time, but at least the other issues are fixed. I'll mark this as closed; thanks, everyone. :D
Describe the bug
Hi! So originally I was having a bit of a problem with trying to run local 13B models. I have 32 GB of RAM, 8 GB of VRAM, and an AMD Ryzen 7 3800 (8 cores at 3.9 GHz).
Since I do not have enough VRAM to run a 13B model, I had decided to use GGML with GPU offloading using the -n-gpu-layers command. Recently I went through a bit of a setup where I updated Oobabooga and in doing so had to re-enable GPU acceleration by reinstalling llama-cpp-python, over on this page: https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md#gpu-acceleration
After that, I was also getting an error with bitsandbytes saying it was installed without GPU support, however Oobabooga still said the GPU offloading was working. I had set n-gpu-layers to 28 and had about 7 GB in VRAM. With those settings, my model load looked like this:
With this setup, with GPU offloading working and bitsandbytes complaining it wasn't installed right, I was getting a slow but fairly consistent ~2 tokens per second. I also decided to do something about the bitsandbytes error, and I found this issue on github where a solution (using pip to install torch-2.0) was posted: https://github.com/oobabooga/text-generation-webui/issues/1969
Unfortunately, it seems like fixing the bitsandbytes error is where I eternally screwed up, as after doing that, performance inside of Oobabooga was basically tanked. Outputs tended to be between 0.3 and 0.4 tokens a second, and on top of that, it takes minutes before the replies even start generating.
Recently I decided to try and fix it again, doing a fresh install with the latest one-click installer for Windows, then git pulling inside the directory to make it extra up-to-date, then going through the steps for installing llama-cpp-python. Now it's even worse somehow. Not only are generation times slow, but the model is outputting total garbage even with using the same settings as before. Here's what that looks like: https://snipboard.io/3FGprt.jpg (The bit before I ask "Why not?" was before I started messing with all this stuff)
Also dunno if it helps, but here's what was generated in console:
Any help is appreciated. :)
Is there an existing issue for this?
Reproduction
Dunno, as this seems to be unique to me, but using the latest one-click installer, git pulling to latest, then manually re-enabling GPU support by installing llama-cpp-python so I could use GPU offloading is how I got here.
Screenshot
No response
Logs
System Info