Compiled for cuBLAS, but not offloading to GPU?

chrisbward commented 1 year ago

➜  go-ggllm.cpp git:(master) ✗ CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin" -t 1 -ngl 100 
falcon.cpp: loading model from /media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |   512 |   8192 |     128 ;   8 |      60 | 40;40B |     2 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: WARNING: no CUDA devices found, falling back to CPU
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required  = 22895.23 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
falcon_model_load_internal: estimated VRAM usage: 32 MB
[==================================================] 100%  Tensors populated, CUDA ready 
falcon_init_from_file: kv self size  =  120.00 MB
Model loaded successfully.
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |   512 |     1 |     0 |     5 |    1688883271 |          UNSPECIFIED | #  2 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+

falcon_predict: prompt: 'Hello
### Response:'
falcon_predict: number of tokens in prompt = 5
  9856 -> 'Hello'
   193 -> '\n'
 19468 -> '###'
 16054 -> ' Response'
    37 -> ':'

chrisbward commented 1 year ago

I suspect this may have to do with my on-board graphics for the Intel chip? I've seen on koboldcpp I had to set the Platform ID, so I wonder...

chrisbward commented 1 year ago

I'm able to offload when I use the built binaries from ggllm.cpp;

➜  build git:(master) ✗ pwd
/home/user/Tools/06_MachineLearning/LLM/go-ggllm.cpp/build
➜  build git:(master) ✗ ./bin/falcon_main -t 8 -ngl 100 -b 1 -m /media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin -p "What is a falcon?\n### Response:"
main: build = 859 (c12b2d6)
falcon.cpp: loading model from /media/NAS/MLModels/02_LLMs/falcon-40b-instruct-GGML/falcon-40b-instruct.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |   512 |   8192 |     128 ;   8 |      60 | 40;40B |     2 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 20804.00 MB  of 22718.00 MB (in use: 1913.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 53 (missing 2277 MB)
falcon_model_load_internal: INFO: 52 layers will be offloaded to GPU (layers 1 to 53)
falcon_model_load_internal: mem required  = 2615.08 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 52 of 60 layers to GPU, weights offloaded 20280.40 MB
falcon_model_load_internal: estimated VRAM usage: 20313 MB
[>-------------------------------------------------]   1%  Loading weights

chrisbward commented 1 year ago

➜  go-ggllm.cpp git:(master) ✗ nvidia-smi
Sun Jul  9 07:32:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off | 00000000:01:00.0  On |                    0 |
| 33%   59C    P3              65W / 450W |   2860MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

rioncarter commented 1 year ago

I'm experiencing this same issue. Here are my steps to reproduce:

cd go-ggllm.cpp
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

BUILD_TYPE=cublas make clean
BUILD_TYPE=cublas make libggllm.a

CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../LocalAI/models/falcon-40b-instruct.ggccv1.q4_0.bin" -t 14 -ngl 100

I get this output which indicates `no CUDA devices found, falling back to CPU':

falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: WARNING: no CUDA devices found, falling back to CPU
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required  = 22895.23 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
falcon_model_load_internal: estimated VRAM usage: 32 MB

@mudler, I'd love your perspective to help get past this issue.

mudler / go-ggllm.cpp

Compiled for cuBLAS, but not offloading to GPU? #2