Open chrisbward opened 1 year ago
I suspect this may have to do with my on-board graphics for the Intel chip? I've seen on koboldcpp I had to set the Platform ID, so I wonder...
I'm able to offload when I use the built binaries from ggllm.cpp;
➜ build git:(master) ✗ pwd
/home/user/Tools/06_MachineLearning/LLM/go-ggllm.cpp/build
➜ build git:(master) ✗ ./bin/falcon_main -t 8 -ngl 100 -b 1 -m /media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin -p "What is a falcon?\n### Response:"
main: build = 859 (c12b2d6)
falcon.cpp: loading model from /media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| Info | format | n_vocab | n_bpe | n_ctx | n_embd | n_head ; kv | n_layer | falcon | ftype | n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| | ggcc v1 | 65024 | 64784 | 512 | 8192 | 128 ; 8 | 60 | 40;40B | 2 | 32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 20804.00 MB of 22718.00 MB (in use: 1913.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 53 (missing 2277 MB)
falcon_model_load_internal: INFO: 52 layers will be offloaded to GPU (layers 1 to 53)
falcon_model_load_internal: mem required = 2615.08 MB (+ 120.00 MB per state)
falcon_model_load_internal: offloading 52 of 60 layers to GPU, weights offloaded 20280.40 MB
falcon_model_load_internal: estimated VRAM usage: 20313 MB
[>-------------------------------------------------] 1% Loading weights
➜ go-ggllm.cpp git:(master) ✗ nvidia-smi
Sun Jul 9 07:32:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:01:00.0 On | 0 |
| 33% 59C P3 65W / 450W | 2860MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
I'm experiencing this same issue. Here are my steps to reproduce:
cd go-ggllm.cpp
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
BUILD_TYPE=cublas make clean
BUILD_TYPE=cublas make libggllm.a
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../LocalAI/models/falcon-40b-instruct.ggccv1.q4_0.bin" -t 14 -ngl 100
I get this output which indicates `no CUDA devices found, falling back to CPU':
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: WARNING: no CUDA devices found, falling back to CPU
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required = 22895.23 MB (+ 120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB
falcon_model_load_internal: estimated VRAM usage: 32 MB
@mudler, I'd love your perspective to help get past this issue.