mudler / LocalAI

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.
https://localai.io
MIT License
21.58k stars 1.65k forks source link

docker container with CUDA12 #1178

Open stefangweichinger opened 8 months ago

stefangweichinger commented 8 months ago

LocalAI version:

Environment, CPU architecture, OS, and Version: Linux fedora 6.5.6-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/Linux

Describe the bug

Trying to follow https://localai.io/howtos/easy-model-import-gallery/

I'd like to use CUDA. Installed toolkit, rebooted

 nvidia-smi 
Mon Oct 16 19:05:10 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   50C    P8              15W / 175W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Followed https://localai.io/howtos/easy-setup-docker-gpu/

Recompiled / rebuilt container etc

I get:

 stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version

Why that? I compiled everything on this fresh fedora box. Where is the mismatch?

To Reproduce

Expected behavior Working container using CUDA

Logs

localai-api-1  | I local-ai build info:
localai-api-1  | I BUILD_TYPE: cublas
localai-api-1  | I GO_TAGS: 
localai-api-1  | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1  | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1  | 5:02PM INF Starting LocalAI using 4 threads, with models path: /models
localai-api-1  | 5:02PM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1  | 5:02PM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1  | 5:02PM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1  | 
localai-api-1  |  ┌───────────────────────────────────────────────────┐ 
localai-api-1  |  │                   Fiber v2.49.2                   │ 
localai-api-1  |  │               http://127.0.0.1:8080               │ 
localai-api-1  |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai-api-1  |  │                                                   │ 
localai-api-1  |  │ Handlers ............ 71  Processes ........... 1 │ 
localai-api-1  |  │ Prefork ....... Disabled  PID ............. 10497 │ 
localai-api-1  |  └───────────────────────────────────────────────────┘ 
localai-api-1  | 
localai-api-1  | [172.22.0.1]:34580 405 - GET /v1/chat/completions
localai-api-1  | 5:02PM DBG Request received: 
localai-api-1  | 5:02PM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 5:02PM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 5:02PM DBG Prompt (before templating): USER: How are you?
localai-api-1  | 5:02PM DBG Template found, input modified to: USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 5:02PM DBG Prompt (after templating): USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 5:02PM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000102820 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1  | 5:02PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 5:02PM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:33301'
localai-api-1  | 5:02PM DBG GRPC Service state dir: /tmp/go-processmanager3078149800
localai-api-1  | 5:02PM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33301: connect: connection refused"
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr 2023/10/16 17:02:42 gRPC Server listening at 127.0.0.1:33301
localai-api-1  | 5:02PM DBG GRPC Service Ready
localai-api-1  | 5:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr 
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: CUDA driver version is insufficient for CUDA runtime version
localai-api-1  | 5:02PM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:33301): stderr current device: 19566432
localai-api-1  | [172.22.0.1]:48336 500 - POST /v1/chat/completions
localai-api-1  | [127.0.0.1]:46348 200 - GET /readyz

Additional context

djmaze commented 8 months ago

@stefangweichinger I had the same error but in a different context. As the LocalAI docker images are not based on the official cuda images by nvidia, you might need to explicitely set the NVIDIA_VISIBLE_DEVICES env variable when running the container.

(You could just add NVIDIA_VISIBLE_DEVICES=all to the .env file.)

Aisuko commented 8 months ago

We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.

stefangweichinger commented 8 months ago

thanks to @djmaze and @Aisuko .. yes, that variable is in the Dockerfile. So how to proceed? I assume there are maybe mismatches between packages from Fedora and Nvidia? I installed Nvidia-stuff from here.

The link is for Fedora 37 ... nothing available for F38 or my F39beta. So maybe the problem comes from that.

I am a newbie with LocalAI and CUDA. So I am only guessing.

stefangweichinger commented 8 months ago

Corrected my docker-compose.yml to the one from here

Added the mentioned variable to .env, yes, redundant.

Toggled "REBUILD" (btw: how do I keep my rebuilt image once it's OK? just toggle the variable to no/false?), restarted. Rebuild ran through, I get this:

localai-api-1  | I local-ai build info:
localai-api-1  | I BUILD_TYPE: cublas
localai-api-1  | I GO_TAGS: 
localai-api-1  | I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"
localai-api-1  | CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=8034ed3" -X "github.com/go-skynet/LocalAI/internal.Commit=8034ed3473fb1c8c6f5e3864933c442b377be52e"" -tags "" -o local-ai ./
localai-api-1  | 6:35AM INF Starting LocalAI using 8 threads, with models path: /models
localai-api-1  | 6:35AM INF LocalAI version: 8034ed3 (8034ed3473fb1c8c6f5e3864933c442b377be52e)
localai-api-1  | 6:35AM DBG Model: lunademo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
localai-api-1  | 6:35AM DBG Extracting backend assets files to /tmp/localai/backend_data
localai-api-1  | 
localai-api-1  |  ┌───────────────────────────────────────────────────┐ 
localai-api-1  |  │                   Fiber v2.49.2                   │ 
localai-api-1  |  │               http://127.0.0.1:8080               │ 
localai-api-1  |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai-api-1  |  │                                                   │ 
localai-api-1  |  │ Handlers ............ 71  Processes ........... 1 │ 
localai-api-1  |  │ Prefork ....... Disabled  PID ............. 10493 │ 
localai-api-1  |  └───────────────────────────────────────────────────┘ 
localai-api-1  | 
localai-api-1  | [127.0.0.1]:50752 200 - GET /readyz
localai-api-1  | 6:36AM DBG Request received: 
localai-api-1  | 6:36AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 6:36AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_0.gguf Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:true Threads:8 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:4 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
localai-api-1  | 6:36AM DBG Prompt (before templating): USER: How are you?
localai-api-1  | 6:36AM DBG Template found, input modified to: USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 6:36AM DBG Prompt (after templating): USER: How are you?
localai-api-1  | 
localai-api-1  | ASSISTANT:
localai-api-1  | 
localai-api-1  | 6:36AM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG Loading GRPC Model llama: {backendString:llama model:luna-ai-llama2-uncensored.Q4_0.gguf threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0005d4680 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
localai-api-1  | 6:36AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
localai-api-1  | 6:36AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_0.gguf will be running at: '127.0.0.1:37101'
localai-api-1  | 6:36AM DBG GRPC Service state dir: /tmp/go-processmanager834974386
localai-api-1  | 6:36AM DBG GRPC Service Started
localai-api-1  | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37101: connect: connection refused"
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr 2023/10/17 06:36:28 gRPC Server listening at 127.0.0.1:37101
localai-api-1  | 6:36AM DBG GRPC Service Ready
localai-api-1  | 6:36AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_0.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_0.gguf
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr 
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr CUDA error 100 at /build/go-llama/llama.cpp/ggml-cuda.cu:5522: no CUDA-capable device is detected
localai-api-1  | 6:36AM DBG GRPC(luna-ai-llama2-uncensored.Q4_0.gguf-127.0.0.1:37101): stderr current device: 19566432
localai-api-1  | [172.22.0.1]:37050 500 - POST /v1/chat/completions

So the build is with CUBLAS, but "CUDA:false" and "no CUDA-capable device".

While:

$ nvidia-smi 
Tue Oct 17 08:35:08 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   48C    P8              15W / 175W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

What am I missing? thanks for any help here.

stefangweichinger commented 8 months ago

I wonder if the docker-compose syntax is OK in my case. Especially the "deploy:"

version: '3.6'

services:
  api:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    tty: true # enable colorized logs
    restart: always # should this be on-failure ?
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models
    command: ["/usr/bin/local-ai" ]

EDIT: Or is that luna-demo-model not working with CUDA? As I said: I am guessing ;-)

stefangweichinger commented 8 months ago

I now went on to play with examples/chatbot-ui and try to get that to work with CUDA. That one uses another model etc / it works but runs on the CPU only.

My edited config:

$ cat docker-compose.yaml 
version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    # As initially LocalAI will download the models defined in PRELOAD_MODELS
    # you might need to tweak the healthcheck values here according to your network connection.
    # Here we give a timespan of 20m to download all the required files.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 20
    build:
      context: ../../
      dockerfile: Dockerfile
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu] 
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      - MODELS_PATH=/models
      # You can preload different models here as well.
      # See: https://github.com/go-skynet/model-gallery
      - 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]'
    volumes:
      - ./models:/models:cached,Z
    command: ["/usr/bin/local-ai" ]
  chatgpt:
    depends_on:
      api:
        condition: service_healthy
    image: ghcr.io/mckaywrigley/chatbot-ui:main
    ports:
      - 3000:3000
    environment:
      - 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
      - 'OPENAI_API_HOST=http://api:8080'
      - 'NVIDIA_VISIBLE_DEVICES=all'
      - 'CUDA_VISIBLE_DEVICES=all'
      - 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'
djmaze commented 8 months ago

We already set NVDIA env at

https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97

Dockerfile.

That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).

One could argue that this is a bug.

stefangweichinger commented 8 months ago

thanks @djmaze I also noticed that I set the env variables for the chatbot-ui container and not for the api. Switching that, testing ... hmm, no.

I have now:

$ cat docker-compose.yaml 
version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12
    # As initially LocalAI will download the models defined in PRELOAD_MODELS
    # you might need to tweak the healthcheck values here according to your network connection.
    # Here we give a timespan of 20m to download all the required files.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 20
    build:
      context: ../../
      dockerfile: Dockerfile
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu] 
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      - MODELS_PATH=/models
      - 'NVIDIA_VISIBLE_DEVICES=all'
      - 'CUDA_VISIBLE_DEVICES=all'
      - 'CUDA_DEVICE_POOL_GPU_OVERRIDE=1'
      # You can preload different models here as well.
      # See: https://github.com/go-skynet/model-gallery
      - 'PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"},
                         {"url": "github:go-skynet/model-gallery/llama2-chat.yaml", "name": "llama2-chat"},
                         {"url": "github:go-skynet/model-gallery/stablediffusion.yaml", "name": "stablediffusion"}]'
    volumes:
      - ./models:/models:cached,Z
    command: ["/usr/bin/local-ai" ]
  chatgpt:
    depends_on:
      api:
        condition: service_healthy
    image: ghcr.io/mckaywrigley/chatbot-ui:main
    ports:
      - 3000:3000
    environment:
      - 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
      - 'OPENAI_API_HOST=http://api:8080'

What do you think?

stefangweichinger commented 8 months ago

I also edited /etc/nvidia-container-runtime/config.toml to fix the permissions. I run the api-container in privileged mode now ... still CUDA isn't used as far as I understand.

Used different docker images:

Rebuilt images.

Still I don't see any processes in nvidia-smi when I run my LocalAI-stack.

stefangweichinger commented 8 months ago

I think it is SElinux:

Okt 17 16:10:39 fedora audit[3085]: AVC avc:  denied  { getattr } for  pid=3085 comm="nvidia-smi" path="/dev/nvidiactl" dev="devtmpfs" ino=796 scontext=system_u:system_r:container_t:s0:c566,c905 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=1
Okt 17 16:10:39 fedora 5840bb46c0e3[1199]: Failed to initialize NVML: Unknown Error

I have it on permissive already. Will see how to fix that.

stefangweichinger commented 8 months ago

In other projects I can use CUDA within docker OK. Just telling.

larkinwc commented 7 months ago

just coming back to this project, im surprised that cuda accel support is listed as a feature but it fails to work with latest containers. I have duplicated issues in both master branch and v1.30...

Someguitarist commented 7 months ago

I'm on Ubuntu 22 LTS and I get the same error:

2:46PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:44099' 2:46PM DBG GRPC Service state dir: /tmp/go-processmanager1917593952 2:46PM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44099: connect: connection refused

No matter what I do. If I use any model with CPU it works, however using any model with GPU support gives me that error. I've tried disable UFW and restarting and still get the same.

Has anyone made any progress with this error? Just as an FYI, I manage to run Frigate, Plex, and text-webui-AI and they each reach the GPU fine, so I don't think it's anything with my setup.

vrijsinghani commented 7 months ago

using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)

docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true

replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value

Someguitarist commented 7 months ago

I had figured my error out. It still says that it can't connect, but you can ignore that. My issue was that there wasn't all four files available for the model I was using. The template, the chat, and the completion, and whatnot.

Someguitarist commented 7 months ago

There was another integration that did that; it has the model output the request in JSON, and it used it's own integration in order to take the json request and send it to the Home Assistant API.

I wasn't able to edit it to work, however, as it was looking for OpenAI integration to modify instead of Extended_OpenAI or Custom_OpenAI, but the blog and integration is listed here if anyone wants to try that as well;

https://blog.teagantotally.rocks/2023/06/05/openai-home-assistant/

I haven't had much of a chance to mess around this week what with Thanksgiving and all, but I'm glad to see more people getting interested in it!

thiner commented 7 months ago

using 22.04, this command runs and uses gpu for me (nvidia rtx 3090)

docker run --gpus all --user 1000:1000 -p 5000:8080 -v /mnt/a/ml/LocalAI/models:/models -ti --rm quay.io/go-skynet/local-ai:master-cublas-cuda11-ffmpeg --models-path=/models --context-size=16384 --threads=16 --f16=true --debug=true --single-active-backend=true

replace --user argument with your userid (helps if you use model/apply to download models from gallery) -v argument with your path to models dir --context-size argument for your desired value --threads for your desired value

I am getting below issue if add --gpus all. could not select device driver "" with capabilities: [[gpu]].

Taronyuu commented 6 months ago

@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.

thiner commented 6 months ago

@thiner Did you install the Nvidia Container Toolkit? It is required to run Docker Containers with Cuda support.

You are right. The issue is encountered by missing gpu driver. I deployed the LocalAI image to k8s cluster, but didn't realize that the cluster node need install driver firstly. Problem has already been solved. Thanks for your reply.

domi-bue commented 5 months ago

Hello,

I'm having the same issue. The Container detects the GPU, but it uses the CPU all the time. That are the logs, that show that it detects the GPU:

11:37AM DBG GRPC Service Ready
11:37AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:em_german_mistral_v01.Q4_0.gguf ContextSize:16384 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/em_german_mistral_v01.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr ggml_init_cublas: found 1 CUDA devices:
11:37AM DBG GRPC(em_german_mistral_v01.Q4_0.gguf-127.0.0.1:43937): stderr   Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

Both NVIDIA Drivers and nvidia-container-toolkit are installed

jreusser commented 2 weeks ago

We already set NVDIA env at https://github.com/go-skynet/LocalAI/blob/3f3162e57c35605ce520a75df0bfe7ace2f73cad/Dockerfile#L95-L97 Dockerfile.

That is correct, but it is only set in the intermediate builder image, but not in the final image. (You can also see the final image contents here).

One could argue that this is a bug.

I agree. After setting NVIDIA_VISIBLE_DEVICES=all in my .env, i am again utilizing my Nvidia card. Prior to this, using the cuda All In One (AIO) image was failing as

localai-api-1 | 1:41PM INF GPU device found but no CUDA backend present

I am on a fresh Ubuntu 22.04 install, and after updating my nvidia-smi and various nvidia drivers was able to get side-by-side parity with Windows 10 performance