CUDA does not work anymore with llama backend

LocalAI version: quay.io/go-skynet/local-ai:v1.22.0-cublas-cuda11

Environment, CPU architecture, OS, and Version:

Linux glados 6.2.0-26-generic #26-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 23:39:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 RTX 3090, Ubuntu 23.04

Describe the bug Previously, I had the v1.18.0 image with cuda11 running correctly. Now, after updating the image to v1.22.0, I get the following error in the debug log when trying to do a chat completion with a llama-based model:

stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:2478: CUDA driver version is insufficient for CUDA runtime version

To Reproduce

Run the mentioned docker image on a system with an nvidia gpu. (Set PRELOAD_MODELS to e.g. '[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]'

Try a chat completion:

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
                        "model": "gpt-3.5-turbo",
                        "messages": [{"role": "user", "content": "How are you?"}],
                        "temperature": 0.9 
                      }' | jq.
{
  "error": {
    "code": 500,
    "message": "could not load model: rpc error: code = Unavailable desc = error reading from server: EOF",
    "type": ""
  }
}

Expected behavior The completion result is returned.

Logs

5:12PM DBG Request received: {"model":"gpt-3.5-turbo","language":"","n":0,"top_p":0,"top_k":0,"temperature":0.9,"max_tokens":0,"echo":false,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"typical_p":0,"seed":0,"file":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"How are you?"}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null}
5:12PM DBG Configuration read: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
5:12PM DBG Parameters: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
5:12PM DBG Prompt (before templating): How are you?
5:12PM DBG Template found, input modified to: Q: How are you?\nA: 
5:12PM DBG Prompt (after templating): Q: How are you?\nA: 
5:12PM DBG Loading model llama from open-llama-7b-q4_0.bin
5:12PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
5:12PM DBG Loading GRPC Model llama: {backendString:llama modelFile:open-llama-7b-q4_0.bin threads:4 assetDir:/tmp/localai/backend_data context:0xc00003c088 gRPCOptions:0xc000c1a2d0 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
5:12PM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama)
5:12PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:43913'
5:12PM DBG GRPC Service state dir: /tmp/go-processmanager1220113550
5:12PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43913: connect: connection refused"
5:12PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:43913): stderr 2023/07/30 17:12:57 gRPC Server listening at 127.0.0.1:43913
5:12PM DBG GRPC Service Ready
5:12PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/models/open-llama-7b-q4_0.bin ContextSize:1024 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:35 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
5:12PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:43913): stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:2478: CUDA driver version is insufficient for CUDA runtime version
[10.0.1.8]:60326  500  -  POST     /v1/chat/completions

Additional context

As an addition: I strace'd the local-ai process inside the container and found out that it is searching for libcuda.so in vain.

try to bump your cuda version to 12.2. It should work with 12.0 no issue, but i had the same error and upgrading to 12.2 made it dissapear. I'm running it on a AW 17R5 with a GTX 1080 and a I9 and driver version 535

Updated my cuda packages to 12.2 and switched the image to -cuda12. Still the same error.

(Update: I had to revert to cuda 11.8 and the .525 driver from the original Ubuntu repositories because the drivers from Nvidia's repository seem not to be working correctly with Ubuntu 23.04.)

I think there is something wrong with the image as libcuda.so is missing in there. There is only the stub version at /usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so which I understand is not enough.

Other open source ai projects are working around these problems by using nvidia/cuda:xxx as a base image (like text-generation-webui).

Bumping this issue. I wrote my comment a bit too late on issue #812, so adding it here to hopefully get more support. I've now tested on two sets of hardware, my primary computer and my NAS, and both had the same issue.

PC 1: RTX 4090 AMD 7900X 32 GB DDR5 RAM

PC 2: RTX 2080 AMD 3700X 64 GB DDR4 RAM.

My logs report the same "error when dialing" issue all the way back to 1.21.0. I haven't gone back farther than that. I was using cuda 12.2 and -cuda12. I'll try doing what djmaze mentioned and reverting cuda version and the old driver to see if that works.

I'll try doing what djmaze mentioned and reverting cuda version and the old driver to see if that works.

@Polkadoty To be clear, reverting the cuda version did not fix this problem. Updating the cuda version using nvidia's repositories just prevented any cuda stuff from working on my system, so I had to revert to at least make the other stuff work again.

Hmm weird, 1.23.2 cuda12 version works fine for me.

root@lxdocker:~# nvidia-smi
Mon Aug  7 14:28:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:2E:00.0 Off |                  N/A |
| 30%   37C    P0              89W / 350W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Maybe you need to update your nvidia driver?

hi, I have tried every possible way (from localai's documentation, github issues in the repo, searching hours on internet, my own testing...) but I cannot get localai running on GPU. I have tested quay images from master back to v1.21, but none is working for me.

=> Please help.

Here is my setup:

On my docker's host:

# nvidia-smi
Wed Aug 16 09:22:26 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070        On  | 00000000:03:00.0 Off |                  N/A |
|  0%   28C    P8               7W / 220W |     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:0A:00.0 Off |                  N/A |
|  0%   25C    P8               5W / 370W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  | 00000000:0B:00.0 Off |                  N/A |
|  0%   26C    P8               6W / 370W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3070        On  | 00000000:0C:00.0 Off |                  N/A |
|  0%   29C    P8               6W / 220W |     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1280      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      1280      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      1280      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      1280      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

In the docker container:

root@96bdf1a0d925:/build# ls -alF
total 7399444
drwxr-xr-x 2 root root       4096 Aug  3 09:55 ./
drwxr-xr-x 1 root root       4096 Aug 16 07:35 ../
-rw-r--r-- 1 root root 3785248281 Apr 15 13:17 ggml-gpt4all-j
-rw-r--r-- 1 root root        257 Aug 16 07:35 gpt-3.5-turbo.yaml
-rw-r--r-- 1 root root 3791749248 Aug  3 08:53 open-llama-7b-q4_0.bin
-rw-r--r-- 1 root root         18 Aug 16 07:35 openllama-chat.tmpl
-rw-r--r-- 1 root root         48 Aug 16 07:35 openllama-completion.tmpl

root@96bdf1a0d925:/build# nvidia-smi
Tue Aug 15 16:34:32 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070        On  | 00000000:03:00.0 Off |                  N/A |
|  0%   30C    P8               7W / 220W |     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:0A:00.0 Off |                  N/A |
|  0%   28C    P8               5W / 370W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  | 00000000:0B:00.0 Off |                  N/A |
|  0%   28C    P8               6W / 370W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3070        On  | 00000000:0C:00.0 Off |                  N/A |
|  0%   31C    P8               6W / 220W |     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I have nvidia container toolkit correctly installed, and have rebooted the server multiple times.

Here is the docker run commands I have tested to no success:

Master branch:

docker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=false -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Rebuild: Off

docker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=false -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:v1.24.1-cublas-cuda12-ffmpeg

Rebuild: On

docker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=true -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:v1.24.1-cublas-cuda12-ffmpeg

Example Output of an execution:

NOTE: It gives errors:

CUDA error 999 at /build/go-llama/llama.cpp/ggml-cuda.cu:4235: unknown error
multiple GRPC rpc error ... connection refused then runs on CPU and responds

@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name  : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
4:50PM INF Starting LocalAI using 8 threads, with models path: /models
4:50PM INF LocalAI version: v1.24.1 (9cc8d9086580bd2a96f5c96a6b873242879c70bc)
4:50PM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Name:gpt-3.5-turbo F16:true Threads:0 Debug:false Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 NUMA:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false} Step:0})
4:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
4:50PM DBG Config overrides map[batch:512 f16:true gpu_layers:35 mmap:true]
4:50PM DBG Checking "open-llama-7b-q4_0.bin" exists and matches SHA
4:50PM DBG File "open-llama-7b-q4_0.bin" already exists and matches the SHA. Skipping download
4:50PM DBG Prompt template "openllama-completion" written
4:50PM DBG Prompt template "openllama-chat" written
4:50PM DBG Written config file /models/gpt-3.5-turbo.yaml

 ┌───────────────────────────────────────────────────┐
 │                   Fiber v2.48.0                   │
 │               http://127.0.0.1:8080               │
 │       (bound on host 0.0.0.0 and port 8080)       │
 │                                                   │
 │ Handlers ............ 56  Processes ........... 1 │
 │ Prefork ....... Disabled  PID ................ 14 │
 └───────────────────────────────────────────────────┘

[127.0.0.1]:43530  200  -  GET      /readyz
[127.0.0.1]:41682  200  -  GET      /readyz
4:52PM DBG Request received:
4:52PM DBG `input`: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Context:context.Background.WithCancel Cancel:0x4b9060 File: ResponseFormat: Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil> Backend: ModelBaseName:}
4:52PM DBG Parameter Config: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Name: F16:false Threads:8 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false} Step:0}
4:52PM DBG Loading model 'open-llama-7b-q4_0.bin' greedly from all the available backends: llama, gpt4all, falcon, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, bloomz, rwkv, whisper, stablediffusion, piper, /build/extra/grpc/exllama/exllama.py, /build/extra/grpc/huggingface/huggingface.py, /build/extra/grpc/autogptq/autogptq.py, /build/extra/grpc/bark/ttsbark.py, /build/extra/grpc/diffusers/backend_diffusers.py
4:52PM DBG [llama] Attempting to load
4:52PM DBG Loading model llama from open-llama-7b-q4_0.bin
4:52PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
4:52PM DBG Loading GRPC Model llama: {backendString:llama model:open-llama-7b-q4_0.bin threads:8 assetDir:/tmp/localai/backend_data context:0xc00003e0b0 gRPCOptions:0xc0001c2000 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
4:52PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
4:52PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:40559'
4:52PM DBG GRPC Service state dir: /tmp/go-processmanager1136322216
4:52PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40559: connect: connection refused"
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr 2023/08/15 16:52:31 gRPC Server listening at 127.0.0.1:40559
4:52PM DBG GRPC Service Ready
4:52PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:open-llama-7b-q4_0.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/open-llama-7b-q4_0.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false}
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr create_gpt_params: loading model /models/open-llama-7b-q4_0.bin
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr CUDA error 999 at /build/go-llama/llama.cpp/ggml-cuda.cu:4235: unknown error
4:52PM DBG [llama] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
4:52PM DBG [gpt4all] Attempting to load
4:52PM DBG Loading model gpt4all from open-llama-7b-q4_0.bin
4:52PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
4:52PM DBG Loading GRPC Model gpt4all: {backendString:gpt4all model:open-llama-7b-q4_0.bin threads:8 assetDir:/tmp/localai/backend_data context:0xc00003e0b0 gRPCOptions:0xc0001c2000 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
4:52PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
4:52PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:33361'
4:52PM DBG GRPC Service state dir: /tmp/go-processmanager940124080
4:52PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33361: connect: connection refused"
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr 2023/08/15 16:52:32 gRPC Server listening at 127.0.0.1:33361
4:52PM DBG GRPC Service Ready
4:52PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:open-llama-7b-q4_0.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/open-llama-7b-q4_0.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false}
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: format     = ggjt v3 (latest)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_vocab    = 32000
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_ctx      = 2048
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_embd     = 4096
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_mult     = 256
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_head     = 32
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_layer    = 32
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_rot      = 128
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_ff       = 11008
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_parts    = 1
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: model size = 7B
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: ggml ctx size =    0.07 MB
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_new_context_with_model: kv self size  = 1024.00 MB
4:52PM DBG [gpt4all] Loads OK
[127.0.0.1]:33344  200  -  GET      /readyz
4:54PM DBG Response: {"object":"text_completion","model":"open-llama-7b-q4_0.bin","choices":[{"index":0,"finish_reason":"stop","text":"… a film called Star Wars: The Force Awakens was released in theaters.\nThe film has become the biggest box office hit of all time, but that wasn’t always the case.\nThe movie was originally slated to be released in April of 2015, but Disney decided to push it back for a few months.\nThe film was originally supposed to be released in December of 2015, but Disney decided to move it back to March of 2016.\nThe film was finally released on March 17, 2016 and has since grossed over $2 billion worldwide.\nThe Force Awakens was originally slated for release in December 2015, but Disney delayed it to April 2016.\nThe movie was released in theaters on March 16, 2017.\nThe film was originally slated for release on March 24, 2018.\nThe film was originally scheduled for release on April 2, 2019.\nThe film was released on April 1, 2020.\nThe film was released on December 21, 2020.\nThe film was released on December 23, 2021.\nThe film was released on December 24, 2022.\nThe film was released on December 31, 2024.\nThe film was released on January 1, 2025.\nThe film was released on February 1, 2026.\nThe film was released on March 1, 2027.\nThe film was released on April 1, 2028.\nThe film was released on May 1, 2029.\nThe film was released on June 1, 2030.\nThe film was released on July 1, 2031.\nThe film was released on August 1, 2032.\nThe film was released on September 1, 2033.\nThe film was released on October 1, 2034.\nThe film was released on November 1, 2035.\nThe film was released on December 1, 2036.\nThe film was released on January 1, 2037"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[192.168.215.248]:62738  200  -  POST     /v1/completions
[127.0.0.1]:36808  200  -  GET      /readyz

Could you try and see if this helps you out: https://cloud.apex-migrations.net/s/8sTpCjG44jqxcyw

I've created a folder with some tmpl files, and a yml file which is a config for the model binary. Follow the readme and report back, it should help you to offload the model to your GPU (just make sure to start with less layers than i have because mine is for a 8gb vram)

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

Sources:

I'm afraid I'm having the same problem.

After getting it to work just with the CPU, I am now trying to get it to work also with an NVIDIA GPU. So I set up a new VM with Debian 12, installed docker, nvidia drivers, container toolkit etc from scratch.

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Set up local.ai as per instructions. Loaded model as per instructions.

But when I run a query, it returns: `{"error":{"code":500,"message":"could not load model: rpc error: code = Unavailable desc = error reading from server: EOF","type":""}}

And in the logs I see: `rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35319: connect: connection refused"

Only thing I can think of ouf of the ordinary would be that I am trying to run this in a rootless docker container (and to get that working I needed to toggle off cgroups).

Any solution in sight?

Thanks.

So I recreated everything with rootful docker but it just does not work. Same errors as before.

Anybody figured it out yet?

What does it mean that the gRPC service is refusing the connection?

on my install i also have the same error, but my model does offload to the GPU just fine. Make sure all YML files are defined as YAML. I had problems with getting my YAML config files detected

on my install i also have the same error, but my model does offload to the GPU just fine. Make sure all YML files are defined as YAML. I had problems with getting my YAML config files detected

Thanks for the feedback.

May I ask (because I am a noob and really wouldn't know how to tell) how do you know that your model offloads to the GPU? Do you get a log entry like "model successfully loaded"?

And does it overall work for you (despite the error) or does ist not work?

If it does work, can I please ask what your system environment looks like (bare metal or VM, operating system (including version), rootful or rootless docker (including version) or direct build, gpu driver version, etc.)? I would like to try and replicate it here.

I actually did rename my docker-compose.yaml to .yml (as this is what I am used to). But it didn't work before that already and that should not have an impact, I'm guessing. My lunademo.yaml is actually a .yaml (and I will remember not to change that).

Thanks!

yes the logs indeed should indicate that the model has been succesfully loaded (depending on the amount of layers, it offloaded some data to your VRAM). You could also run "watch nvidia-smi" to launch a host process to monitor what happends on the driver level of your gpu. It should indicat when a model was loaded, and offloaded.

Okay, I found the problem in my case. I am using swarm mode and it turns out I needed to explicitely set the env variable NVIDIA_VISIBLE_DEVICES on the container. Turns out this is explicitely set in the official CUDA images. That explains why I don't have this problem in other OSS AI projects which are using those official images as a base.

It seems to me all the other problems reported here have different causes, so I will close this issue. Feel free to open new issues as necessary.

mudler / LocalAI