Open 7omb opened 7 months ago
Add cuBLAS hardware acceleration to llama-cpp-python. This allows layers of gguf models like Llama-2-13B-chat-GGUF to be offloaded to the GPU with the n-gpu-layers setting:
n-gpu-layers
llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 5363.06 MB (+ 3200.00 MB per state) llm_load_tensors: offloading 16 repeating layers to GPU llm_load_tensors: offloaded 16/43 layers to GPU llm_load_tensors: VRAM used: 3439 MB ................................................................................................... llama_new_context_with_model: kv self size = 3200.00 MB llama_new_context_with_model: compute buffer total size = 351.47 MB llama_new_context_with_model: VRAM scratch buffer: 350.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 2023-11-25 19:13:45 INFO:Loaded the model in 1.31 seconds.
Add cuBLAS hardware acceleration to llama-cpp-python. This allows layers of gguf models like Llama-2-13B-chat-GGUF to be offloaded to the GPU with the
n-gpu-layers
setting: