Open funkytaco opened 1 year ago
Get a 4090 rig
Similar experience with 13B model on MacBook Pro M1. But 7B model works fine at ~4 tokens/sec with this fork: https://github.com/krychu/llama. Would be curious to hear if it works equally good on MacBook Air.
Worse. Way worse on your repo. I'm not even sure what your repo adds. 80 seconds or so to load, and it loaded 1 token at 0.02/s before I gave up on it.
It even broke when trying this from apple: https://developer.apple.com/metal/pytorch/ even though the tests showed mps was working.
First time trying this in text generation web ui. Any insights on why it might be slow? Macbook M1 2020 using text generation webui
python3 server.py --listen --trust-remote-code --cpu-memory 8 --gpu-memory 8 --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML --notebook
2023-07-21 06:44:08 WARNING:trust_remote_code is enabled. This is dangerous. /opt/homebrew/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " 'NoneType' object has no attribute 'cadam32bit_grad_fp32' 2023-07-21 06:44:09 INFO:Loading TheBloke_Llama-2-13B-chat-GGML... 2023-07-21 06:44:09 INFO:llama.cpp weights detected: models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_0.bin
2023-07-21 06:44:09 INFO:Cache capacity is 0 bytes llama.cpp: loading model from models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 8953.71 MB (+ 1608.00 MB per state) llama_new_context_with_model: kv self size = 1600.00 MB AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 2023-07-21 06:44:09 INFO:Loaded the model in 0.17 seconds.
2023-07-21 06:44:09 INFO:Loading the extension "openai"... Starting OpenAI compatible api: OPENAI_API_BASE=http://0.0.0.0:5001/v1 Running on local URL: http://0.0.0.0:7860
To create a public link, set
share=True
inlaunch()
.