LocalAI version: 1.22.0

Environment, CPU architecture, OS, and Version: Linux namehere 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

GPU: NVIDIA RTX 6000 ADA

Describe the bug I am unable to achieve GPU offloading with Falcon-40b models. Zero layers are offloaded to the GPU when there should be sufficient VRAM to load the whole quantized model.

To Reproduce

Download either of these models:
- TheBloke/falcon-40b-instruct-GGML
- https://huggingface.co/TheBloke/falcon-40b-instruct-GGML/resolve/main/falcon-40b-instruct.ggccv1.q4_0.bin
- TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
- https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML/resolve/main/wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin

Set the model yaml to look like this:

batch: 512
context_size: 2048
f16: true
gpu_layers: 100
low_vram: false
mmap: true
name: falcon-40b
parameters:
model: falcon-40b-instruct.ggccv1.q4_0.bin
temperature: 0.2
top_k: 80
top_p: 0.7
template:
chat: openllama-chat
completion: openllama-completion

Ensure the environment variables are set like this:
```
MODELS_PATH=/models
DEBUG=true
BUILD_TYPE=cublas
```
Post to the completions API like this:

curl --request POST \
  --url http://localhost:8080/v1/completions \
  --header 'Content-Type: application/json' \
  --data '{
     "model": "falcon-40b",   
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
}'

Examine the debug log, note that no layers are offloaded (relevant line: falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB)
Queries take 5 minutes where they should be using GPU offloading and completing much faster

Expected behavior

Queries take 5 minutes where they should be using GPU offloading and completing much faster

Logs localai-api-1 | @@@@@ localai-api-1 | Skipping rebuild localai-api-1 | @@@@@ localai-api-1 | If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true localai-api-1 | If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed: localai-api-1 | CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" localai-api-1 | see the documentation at: https://localai.io/basics/build/index.html localai-api-1 | Note: See also https://github.com/go-skynet/LocalAI/issues/288 localai-api-1 | @@@@@ localai-api-1 | 4:25AM INF Starting LocalAI using 14 threads, with models path: /models localai-api-1 | 4:25AM INF LocalAI version: v1.22.0 (bed9570e48581fef474580260227a102fe8a7ff4) localai-api-1 | 4:25AM DBG Model: falcon-wizard-40b (config: {PredictionOptions:{Model:wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:falcon-wizard-40b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2048 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:falcon TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:100 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}) localai-api-1 | 4:25AM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}) localai-api-1 | 4:25AM DBG Model: falcon-40b (config: {PredictionOptions:{Model:falcon-40b-instruct.ggccv1.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:falcon-40b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2048 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:falcon TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:100 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}) localai-api-1 | 4:25AM DBG Model: falcon-7b (config: {PredictionOptions:{Model:falcon-7b-instruct.ggccv1.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:falcon-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2048 F16:true NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:falcon TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:60 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}) localai-api-1 | 4:25AM DBG Extracting backend assets files to /tmp/localai/backend_data localai-api-1 | localai-api-1 | ┌───────────────────────────────────────────────────┐ localai-api-1 | │ Fiber v2.48.0 │ localai-api-1 | │ http://127.0.0.1:8080 │ localai-api-1 | │ (bound on host 0.0.0.0 and port 8080) │ localai-api-1 | │ │ localai-api-1 | │ Handlers ............ 32 Processes ........... 1 │ localai-api-1 | │ Prefork ....... Disabled PID ................. 7 │ localai-api-1 | └───────────────────────────────────────────────────┘ localai-api-1 | localai-api-1 | localai-api-1 | 4:25AM DBG Request received: {"model":"falcon-40b","language":"","n":0,"top_p":0,"top_k":0,"temperature":0.7,"max_tokens":0,"echo":false,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"typical_p":0,"seed":0,"file":"","response_format":"","size":"","prompt":"A long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null} localai-api-1 | 4:25AM DBG input: &{PredictionOptions:{Model:falcon-40b Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} File: ResponseFormat: Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input: Stop: Messages:[] Functions:[] FunctionCall: Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:} localai-api-1 | 4:25AM DBG Parameter Config: &{PredictionOptions:{Model:falcon-40b-instruct.ggccv1.q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:falcon-40b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2048 F16:true NUMA:false Threads:14 Debug:true Roles:map[] Embeddings:false Backend:falcon TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:100 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:} localai-api-1 | 4:25AM DBG Template found, input modified to: Q: Complete the following text: A long time ago in a galaxy far, far away\nA: localai-api-1 | localai-api-1 | localai-api-1 | 4:25AM DBG Loading model falcon from falcon-40b-instruct.ggccv1.q4_0.bin localai-api-1 | 4:25AM DBG Loading model in memory from file: /models/falcon-40b-instruct.ggccv1.q4_0.bin localai-api-1 | 4:25AM DBG Loading GRPC Model falcon: {backendString:falcon modelFile:falcon-40b-instruct.ggccv1.q4_0.bin threads:14 assetDir:/tmp/localai/backend_data context:0xc00003c098 gRPCOptions:0xc000116a20 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]} localai-api-1 | 4:25AM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/falcon) localai-api-1 | 4:25AM DBG GRPC Service for falcon-40b-instruct.ggccv1.q4_0.bin will be running at: '127.0.0.1:43347' localai-api-1 | 4:25AM DBG GRPC Service state dir: /tmp/go-processmanager2808078644 localai-api-1 | 4:25AM DBG GRPC Service Started localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43347: connect: connection refused" localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 2023/07/28 04:25:07 gRPC Server listening at 127.0.0.1:43347 localai-api-1 | 4:25AM DBG GRPC Service Ready localai-api-1 | 4:25AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:/models/falcon-40b-instruct.ggccv1.q4_0.bin ContextSize:2048 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:100 MainGPU: TensorSplit: Threads:14 LibrarySearchPath:} localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon.cpp: loading model from /models/falcon-40b-instruct.ggccv1.q4_0.bin localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon.cpp: file version 10 localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | Info | format | n_vocab | n_bpe | n_ctx | n_embd | n_head ; kv | n_layer | falcon | ftype | n_ff | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | | ggcc v1 | 65024 | 64784 | 2048 | 8192 | 128 ; 8 | 60 | 40;40B | 2 | 32768 | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 22449.00 MB) localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: using CUDA for GPU acceleration localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: mem required = 23704.23 MB (+ 480.00 MB per state) localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_model_load_internal: estimated VRAM usage: 32 MB [==================================================] 100% Tensors populated, CUDA ready derr localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_init_from_file: kv self size = 480.00 MB localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr Prompt: Q: Complete the following text: A long time ago in a galaxy far, far away\nA: localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | | 64 | 1.100 | 0.000 | 0.000 | 80 | 0.000 | 0.700 | 0.000 | 0.70 | 0 | 0.1000 | 5.00000 | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | Generation | Ctx | Batch | Keep | Prom. | Seed | Finetune | Stop | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +------------+-------+-------+-------+-------+---------------+----------------------+------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr | | 2048 | 512 | 0 | 23 | 1690518309 | UNSPECIFIED | # 2 | localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr +------------+-------+-------+-------+-------+---------------+----------------------+------+ localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_predict: prompt: 'Q: Complete the following text: A long time ago in a galaxy far, far away\nA: localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr ' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr falcon_predict: number of tokens in prompt = 23 localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 60 -> 'Q' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 37 -> ':' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 14437 -> ' Complete' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 248 -> ' the' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 1863 -> ' following' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 2288 -> ' text' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 37 -> ':' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 317 -> ' A' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 916 -> ' long' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 601 -> ' time' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 2323 -> ' ago' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 272 -> ' in' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 241 -> ' a' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 28608 -> ' galaxy' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 1825 -> ' far' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 23 -> ',' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 1825 -> ' far' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 1514 -> ' away' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 71 -> '\' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 89 -> 'n' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 44 -> 'A' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 37 -> ':' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr 4610 -> ' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr ' localai-api-1 | 4:25AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:43347): stderr

Additional context

I had previously commented on this discussion: https://github.com/go-skynet/LocalAI/discussions/799 (understanding GPU offloading).
In searching Discord I did not find anything which appeared to address my situation.
I found this Issue Cuda inference doesn't work anymore! #812, unfortunately when I tried a master build it does not seem to address the issue
On the https://localai.io/model-compatibility/ page, it indicates that Falcon-7b and Falcon-40b should be usable with CUDA acceleration (7b and 40b with theggccvformat, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML). I'm new enough here that it's not clear to me if CUDA is different than cublas (?)

I re-tried with 1.23.0 and see the same message: offloading 0 of 60 layers to GPU (slightly longer snippet below)

localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_model_load_internal: using CUDA for GPU acceleration
localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_model_load_internal: mem required  = 23704.23 MB (+  480.00 MB per state)
localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_model_load_internal: estimated VRAM usage: 32 MB
[==================================================] 100%  Tensors populated, CUDA ready derr 
localai-api-1  | 3:32AM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:33827): stderr falcon_init_from_file: kv self size  =  480.00 MB

====

In case it's helpful: I was able to use cmp-nct/ggllm.cpp directly (at this point in time: https://github.com/cmp-nct/ggllm.cpp/tree/66aa59e790f097bd8f19e8749c0a7f29e84a0fe1) to offload all falcon-40b layers to the GPU with this CLI:

$ ./falcon_main -m ./models/falcon-40b-instruct.ggccv1.q4_0.bin -n 512 -ngl 100 -b 512 -p "I am the very model of a" -ts 2,1
main: build = 881 (66aa59e)
falcon.cpp: loading model from ./models/falcon-40b-instruct.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |     2 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 71559.00 MB  of 72864.00 MB (in use: 1305.00 MB)
falcon_model_load_internal: INFO: using n_batch larger than 1 requires additional VRAM per device: 1754.00 MB
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: mem required  =  783.05 MB (+  720.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 22921.20 MB
falcon_model_load_internal: estimated VRAM usage: 24676 MB
[==================================================] 100%  Tensors populated, CUDA ready 
falcon_context_prepare: Context falcon_main RAM buffers - key_val =  240.00 MB, Compute =  256.00 MB, Scratch 0 =  831.00 MB, Scratch 1 =  168.00 MB 
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     2 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA RTX 6000 Ada Generation     |   48647 MB |  32928 MB |  15718 MB |      0.0% |   Primary |
|  1 | NVIDIA GeForce RTX 4090            |   24217 MB |  16023 MB |   8194 MB |     66.7% | Secondary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|    | Device summary                     |   72864 MB |  48951 MB |  23913 MB |       N/A |       All |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

I'll try to explore further, would appreciate any tips on where to investigate so I can submit a PR to address the issue.

Thanks!

After a bit more investigation I think I'm starting to understand how this fits together. In looking at the mudler/go-ggllm.cpp code referenced in the LocalAI makefile (https://github.com/mudler/go-ggllm.cpp/tree/862477d16eefb0805261c19c9b0d053e3b2b684b) I was able to use GPU offloading for falcon-40b:

$ ./bin/falcon_main -t 8 -ngl 100 -b 512 -m ./models/falcon-40b-instruct.ggccv1.q4_0.bin -p "What is a falcon?\n### Response:"
main: build = 859 (c12b2d6)
falcon.cpp: loading model from ../../LocalAI/models/falcon-40b-instruct.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |   512 |   8192 |     128 ;   8 |      60 | 40;40B |     2 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 71986.00 MB  of 72864.00 MB (in use:  877.00 MB)
falcon_model_load_internal: INFO: using n_batch larger than 1 requires additional VRAM per device: 1754.00 MB
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
falcon_model_load_internal: mem required  =  272.03 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 22921.20 MB
falcon_model_load_internal: estimated VRAM usage: 24676 MB
[==================================================] 100%  Tensors populated, CUDA ready 
falcon_init_from_file: kv self size  =  120.00 MB

Now I'm puzzling through how the number of GPU layers gets passed to go-ggllm. I see this in falcon.go:

    if opts.NGPULayers != 0 {
        ggllmOpts = append(ggllmOpts, ggllm.SetGPULayers(int(opts.NGPULayers)))
    }

If I follow opts.NGPULayers back, it appears to be set from c.NGPULayers) in options.go... however I'm lost as to how this gets set. (in GoLand I used 'find usages' and didn't see a clear place where the value was set.

I would appreciate any comments or guidance on where to look further. It's my goal to use golang and LocalAI.

While adding comments to go-ggllm/falcon.go I found that the expected count of GPU layers are being passed to go-ggllm. Here's the function I changed:

func New(model string, opts ...ModelOption) (*Falcon, error) {
    mo := NewModelOptions(opts...)
    log.Println("++++++++++++++++++++ mo.NGPULayers", mo.NGPULayers)
    modelPath := C.CString(model)
    result := C.falcon_load_model(modelPath, C.int(mo.ContextSize), C.int(mo.Seed), C.bool(mo.F16Memory), C.bool(mo.MLock), C.bool(mo.Embeddings), C.bool(mo.MMap), C.bool(mo.VocabOnly), C.int(mo.NGPULayers), C.int(mo.NBatch), C.CString(mo.MainGPU), C.CString(mo.TensorSplit))
    if result == nil {
        return nil, fmt.Errorf("failed loading model")
    }
    log.Println("++++++++++++++++++++ Model loaded, check above to see if offloading is working as expected")
    ll := &Falcon{state: result, contextSize: mo.ContextSize, embeddings: mo.Embeddings}

    return ll, nil
}

In this test I rebuilt with make clean && make BUILD_TYPE=cublas build then attempted to use the /v1/completions api again where I noticed this output:

6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr 2023/07/30 18:53:47 ++++++++++++++++++++ mo.NGPULayers 100
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon.cpp: loading model from models/falcon-40b-instruct.ggccv1.q4_0.bin
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon.cpp: file version 10
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr |          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr |               |    ggcc v1 |   65024 |   64784 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |     2 |  32768 |
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr +---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: using CUDA for GPU acceleration
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: mem required  = 23704.23 MB (+  480.00 MB per state)
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_model_load_internal: estimated VRAM usage: 32 MB
[==================================================] 100%  Tensors populated, CUDA ready 
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr falcon_init_from_file: kv self size  =  480.00 MB
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr 2023/07/30 18:53:47 ++++++++++++++++++++ Model loaded, check above to see if offloading is working as expected
6:53PM DBG GRPC(falcon-40b-instruct.ggccv1.q4_0.bin-127.0.0.1:37425): stderr Prompt: Q: Complete the following text: A long time ago in a galaxy far, far away\nA:

Inspecting the above out put shows that mo.NGPULayers is set to 100 as expected... yet after the model loads we see offloading 0 of 60 layers to GPU.

I am not sure what to check next as when I ran go-ggllm directly (see https://github.com/go-skynet/LocalAI/issues/829#issuecomment-1657186178) it successfully offloads to the GPU

At this point I would greatly appreciate any assistance from anyone with more knowledge about how these pieces fit together. It is a bit baffling to me why GPU acceleration fails for Falcon-40b and I'm grasping for connective tissue...

I re-checked my previous apparent success with go-ggllm and found that I was just executing the c++ compiled executables directly. When I try to use go-ggllm like this:

cd go-ggllm.cpp
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

BUILD_TYPE=cublas make clean
BUILD_TYPE=cublas make libggllm.a

CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../LocalAI/models/falcon-40b-instruct.ggccv1.q4_0.bin" -t 14 -ngl 100

I get this output which indicates `no CUDA devices found, falling back to CPU':

falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: WARNING: no CUDA devices found, falling back to CPU
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required  = 22895.23 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
falcon_model_load_internal: estimated VRAM usage: 32 MB

Since the falcon_main executable from ggllm.cpp successfully offloads to the GPU, it seems like there's a linkage issue with go-ggllm. I've commented on an issue in the go-ggllm project which describes this issue: https://github.com/mudler/go-ggllm.cpp/issues/2#issuecomment-1659092691

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

Sources:

Sorry for somehow hijacking this: can't help with the GPU question, as my test server doesn't have one ...

Trying to run falcon-40b on CPU only with 32 threads and 256 GB RAM ... How do I get the model files from https://huggingface.co/tiiuae/falcon-40b into local.ai?

Maybe you already solved your issue, I hope so.

mudler / LocalAI

GPU offloading does not appear to work for falcon-40b #829

====

:warning::warning::warning::warning::warning:

:warning::warning::warning::warning::warning: