nixified-ai / flake

A Nix flake for many AI projects
GNU Affero General Public License v3.0
624 stars 69 forks source link

Add CUDA hardware acceleration for textgen #70

Open 7omb opened 7 months ago

7omb commented 7 months ago

Add cuBLAS hardware acceleration to llama-cpp-python. This allows layers of gguf models like Llama-2-13B-chat-GGUF to be offloaded to the GPU with the n-gpu-layers setting:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 5363.06 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/43 layers to GPU
llm_load_tensors: VRAM used: 3439 MB
...................................................................................................
llama_new_context_with_model: kv self size  = 3200.00 MB
llama_new_context_with_model: compute buffer total size =  351.47 MB
llama_new_context_with_model: VRAM scratch buffer: 350.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-25 19:13:45 INFO:Loaded the model in 1.31 seconds.