meta-llama / llama

Inference code for Llama models
Other
56.09k stars 9.53k forks source link

Extremely slow text generation on Macbook Air 2020 M1 #472

Open funkytaco opened 1 year ago

funkytaco commented 1 year ago

First time trying this in text generation web ui. Any insights on why it might be slow? Macbook M1 2020 using text generation webui

python3 server.py --listen --trust-remote-code --cpu-memory 8 --gpu-memory 8 --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML --notebook

2023-07-21 06:44:08 WARNING:trust_remote_code is enabled. This is dangerous. /opt/homebrew/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " 'NoneType' object has no attribute 'cadam32bit_grad_fp32' 2023-07-21 06:44:09 INFO:Loading TheBloke_Llama-2-13B-chat-GGML... 2023-07-21 06:44:09 INFO:llama.cpp weights detected: models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_0.bin

2023-07-21 06:44:09 INFO:Cache capacity is 0 bytes llama.cpp: loading model from models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 8953.71 MB (+ 1608.00 MB per state) llama_new_context_with_model: kv self size = 1600.00 MB AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 2023-07-21 06:44:09 INFO:Loaded the model in 0.17 seconds.

2023-07-21 06:44:09 INFO:Loading the extension "openai"... Starting OpenAI compatible api: OPENAI_API_BASE=http://0.0.0.0:5001/v1 Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().

253153 commented 1 year ago

Get a 4090 rig

krychu commented 1 year ago

Similar experience with 13B model on MacBook Pro M1. But 7B model works fine at ~4 tokens/sec with this fork: https://github.com/krychu/llama. Would be curious to hear if it works equally good on MacBook Air.

funkytaco commented 1 year ago

Worse. Way worse on your repo. I'm not even sure what your repo adds. 80 seconds or so to load, and it loaded 1 token at 0.02/s before I gave up on it.

It even broke when trying this from apple: https://developer.apple.com/metal/pytorch/ even though the tests showed mps was working.