fedelrick commented 1 year ago

So i've been getting back into IT after years away and dabbling with the AI models. I successful ran LLaMa.cpp in win64devkit the other day, although very slowly. I am trying to now run the same model bin in oobabooga but am getting the issues below. Does anyone know how to fix this? i'm hoping its a simple setting i haven't configured. Any tips would be greatly appreciated, also im new to github, so be kind haha.

Traceback (most recent call last):

File “C:\Users\ijasp\Desktop\oobabooga_windows\text-generation-webui\server.py”, line 68, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader) File “C:\Users\ijasp\Desktop\oobabooga_windows\text-generation-webui\modules\models.py”, line 78, in load_model

output = load_func_maploader File “C:\Users\ijasp\Desktop\oobabooga_windows\text-generation-webui\modules\models.py”, line 232, in llamacpp_loader

from modules.llamacpp_model import LlamaCppModel File “C:\Users\ijasp\Desktop\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py”, line 11, in

import llama_cpp File “C:\Users\ijasp\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\llama_cppinit.py”, line 1, in

from .llama_cpp import * File “C:\Users\ijasp\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py”, line 1292, in

llama_backend_init(c_bool(False)) File “C:\Users\ijasp\Desktop\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py”, line 403, in llama_backend_init

return _lib.llama_backend_init(numa) OSError: [WinError -1073741795] Windows Error 0xc000001d

Cregrant commented 1 year ago

This solution works for me https://github.com/oobabooga/text-generation-webui/issues/3276#issuecomment-1648532571 Of course, you should run it from the /installer_files/env folder

jllllll commented 1 year ago

Use cmd_windows.bat to run the command from that link. Should work fine if you are using CUDA, which doesn't seem to be the case from your error log. If using CPU-only, run these commands instead:

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_AVX2=off"
python -m pip install git+https://github.com/abetlen/llama-cpp-python@v0.1.77 --force-reinstall --no-deps

This is all assuming that the error is caused by a lack of AVX2 support in your CPU.

TFWol commented 1 year ago

@jllllll Do you know if turning off AVX2 will still allow CUDA to work?

You can see what I mean where I was talking to koboldcpp dev here

--- Edit I see a new version was released recently, so I'll give it a shot and verify again.

jllllll commented 1 year ago

@TFWol Should work fine. Others have reported this issue before and used non-AVX2 builds to correct it.

The commands I showed above are for a non-CUDA build. I have pre-built wheels for AVX cuBLAS llama-cpp-python: https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels

text-generation-webui uses a separate renamed package for cuBLAS builds to allow for easy out of the box switching between CPU-only and CUDA. This will install an AVX version of that package:

python -m pip install llama-cpp-python-cuda --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX/cu117

You may need to add --force-reinstall --no-deps to that command to replace the existing installation.

TFWol commented 1 year ago

@TFWol Should work fine. Others have reported this issue before and used non-AVX2 builds to correct it.

The commands I showed above are for a non-CUDA build. I have pre-built wheels for AVX cuBLAS llama-cpp-python: https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels

text-generation-webui uses a separate renamed package for cuBLAS builds to allow for easy out of the box switching between CPU-only and CUDA. This will install an AVX version of that package:
python -m pip install llama-cpp-python-cuda --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/textgen/AVX/cu117

Thanks for the quick reply and pointing out the wheels. Your workflow commands gave a bit more insight on some things for me as well.

I'll give that a shot.

fedelrick commented 1 year ago

So many good replies 😅 I'm at work ATM, so this afternoon I'll go through and try suss it out. So far I've tried it with the quick install folder and the manual installation. Is cuda not part of these? :) Sorry if that's a silly question.

TFWol commented 1 year ago

@jllllll Yep! That allowed me to load ggml's in text-gen. I'll revisit the kobold issue at some point. Thanks a bunch!

fedelrick commented 1 year ago

I tried running the above code, i also updated the files and running the code again, the full error has changed but it still ends in the same termination. When running llama.cpp directly i had success using openblas. Could this be related? alternatively, can i run llama.cpp directly and then link the webui to it? Or alternatively again, what if i ran the webui with win64devkit?

jllllll commented 1 year ago

I tried running the above code, i also updated the files and running the code again, the full error has changed but it still ends in the same termination. When running llama.cpp directly i had success using openblas. Could this be related? alternatively, can i run llama.cpp directly and then link the webui to it? Or alternatively again, what if i ran the webui with win64devkit?

What CPU do you have? It may be missing more than just AVX2. When running llama.cpp directly, what commands did you use to build it? If you build it with -DBUILD_SHARED_LIBS=ON then you can copy the resulting llama.dll file to \installer_files\env\lib\site-packages\llama_cpp to use it with the webui.

fedelrick commented 1 year ago

Im currently running it on a basic laptop testbench with an pentium N6000 silver. No AVX2 support. For llama.cpp i used the following section to build it:

On Windows:

Download the latest fortran version of w64devkit.

Download the latest version of OpenBLAS for Windows.

Extract w64devkit on your pc.

From the OpenBLAS zip that you just downloaded copy libopenblas.a, located inside the lib folder, inside w64devkit\x86_64-w64-mingw32\lib.

From the same OpenBLAS zip copy the content of the include folder inside w64devkit\x86_64-w64-mingw32\include.

Run w64devkit.exe.

Use the cd command to reach the llama.cpp folder.

From here you can run:

make LLAMA_OPENBLAS=1

I then ran the following:

obtain the original LLaMA model weights and place them in ./models

ls ./models 65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

[Optional] for models using BPE tokenizers

ls ./models 65B 30B 13B 7B vocab.json

install Python dependencies

python3 -m pip install -r requirements.txt

convert the 7B model to ggml FP16 format

python3 convert.py models/7B/

[Optional] for models using BPE tokenizers

python convert.py models/7B/ --vocabtype bpe

quantize the model to 4-bits (using q4_0 method)

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

run the inference

./main -m ./models/7B/ggml-model-q4_0.bin -n 128

From there i was able to run examples and similar very slowly. Although i think from memory to get it to work i had to remove the 3 from all python3 commands. There was also i step i skipped, i cant remember if it was the quantizing or the converting, but one of them had already been done on the model i had I THINK. Ill try do a fresh run this afternoon on both llama.cpp and the webui and report back the exact commands and process i used :D Thanks again so much for your fast replies.

jllllll commented 1 year ago

Looking up the pentium N6000 silver, I see that it also doesn't support AVX, FMA or F16C.

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off"
python -m pip install git+https://github.com/abetlen/llama-cpp-python@v0.1.77 --force-reinstall --no-deps

If you want OpenBLAS as well, use:

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_BLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS"
python -m pip install git+https://github.com/abetlen/llama-cpp-python@v0.1.77 --force-reinstall --no-deps

You may need to add the OpenBLAS include folder to PATH or use: set "BLAS_INCLUDE_DIRS=D:\path\to\OpenBLAS\include"

jllllll commented 1 year ago

I now have pre-built CPU-only packages for various CPU instruction sets. This one is without any of the instructions your CPU lacks, though it does not use OpenBLAS:

python -m pip install llama-cpp-python --force-reinstall --no-deps --index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/basic/cpu

fedelrick commented 1 year ago

To create a public link, set share=True in launch(). 2023-08-08 21:14:25 INFO:Loading ggml.v3.q4_K_S.bin... 2023-08-08 21:14:25 INFO:llama.cpp weights detected: models\ggml.v3.q4_K_S.bin

2023-08-08 21: 14:25 INFO:Cache capacity is 0 bytes llama.cpp: loading model from models\ggml.v3.q4_K_S.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 14 (mostly Q4_K - Small) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 4045.96 MB (+ 1024.00 MB per state) llama_new_context_with_model: kv self size = 1024.00 MB AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 2023-08-08 21:14:26 INFO:Loaded the model in 1.68 seconds.

YOU ARE A LEGEND! I did your initial steps with open blas and its working, i will try the other steps this weekend and determine what runs faster. Now to optimize :D

Joeweav commented 1 year ago

Thank you jllllll,

set FORCE_CMAKE=1 set "CMAKE_ARGS=-DLLAMA_AVX2=off" python -m pip install git+https://github.com/abetlen/llama-cpp-python@v0.1.77 --force-reinstall --no-deps

Worked like a charm on my old i5 notebook. The only caveat is it seems you need vs installed. One of my machines had it and poof, running and the other one lacked it and I got a note saying I needed it to make the pieces.

I can now open up TheBloke_orca_mini_3B-GGML It is slow, but it runs.

fedelrick commented 1 year ago

Would it be at all possible to get it to run on the integrated graphics at the same time? I'm getting a whopping 0.01 tokens per second 😂

jllllll commented 1 year ago

CLBlast might do that. A bit more complicated to set up if it isn't pre-tuned for that CPU, which it probably isn't.

TFWol commented 1 year ago

@jllllll You wouldn't happen to have the issue of RAM not being released when loading a GGML with CUDA would you?

Even if n-gpu-layers is maxed

jllllll commented 1 year ago

@TFWol Yes. For some reason, GPU offloading isn't truly offloading in the current version. Not sure what the cause is. Probably a bug in llama.cpp. This will likely be fixed in the next version of llama-cpp-python.

TFWol commented 1 year ago

Thanks for confirming. It's been driving me nuts.

TFWol commented 11 months ago

@jllllll Sorry for pinging you. Before I bother compiling all the stuff again I was wondering if GPU offloading it still an issue.

jllllll commented 11 months ago

It seemed to have been fixed at one point, but is currently still keeping memory in RAM on my system.

TFWol commented 11 months ago

Drat. Oh well.

Thank you very much for the reply.

github-actions[bot] commented 10 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

oobabooga / text-generation-webui

OSError: [WinError -1073741795] Windows Error 0xc000001d Help fixing? #3475

obtain the original LLaMA model weights and place them in ./models

[Optional] for models using BPE tokenizers

install Python dependencies

convert the 7B model to ggml FP16 format

[Optional] for models using BPE tokenizers

quantize the model to 4-bits (using q4_0 method)

run the inference