GPU layers param breaks the model i.e. I am not able to utilise my GPU for llama 2

KramPiotr commented 8 months ago

LocalAI version:

I am on commit: 574fa67bdcafd618859fcda4d239f10f326182a6

Environment, CPU architecture, OS, and Version:

I am on Windows and using WSL2 with Ubuntu 22.04: Linux LAPTOP-FN76EJ99 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug

Running llama-2 with the gpu_layers param commented out (no GPU offloading) works but with 1+ GPU layers offloaded the requests comes up with an error or runs much slower (I haven't seen it succeded in this scenario yet). Querying the model with no GPU layers param completes the basic request in a few minutes.

To Reproduce

The command I use to run Local AI: docker run --rm -ti --env-file .env --gpus all -p 8080:8080 -v $(pwd)/models:/models quay.io/go-skynet/local-ai:master-cublas-cuda12

.env:

THREADS=6
MODELS_PATH=/models
DEBUG=true
BUILD_TYPE=cublas

Configuration for the model I am using (llama-2-13b.yaml):

name: llama-2-13b
parameters:
  model: llama-2-13b.Q5_K_M.gguf
  temperature: 0.7
threads: 6
gpu_layers: 1 <-- when I comment this out then the model works
low_vram: true
cuda: true

The command I use for testing:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-13b",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

The output of docker run when no GPU layers set (that also contains my GPU information):

gpus all -p 8080:8080 -v $(pwd)/models:/models quay.io/go-skynet/local-ai:master-cublas-cuda12
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name      : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
10:45PM DBG no galleries to load
10:45PM INF Starting LocalAI using 6 threads, with models path: /models
10:45PM INF LocalAI version: 574fa67 (574fa67bdcafd618859fcda4d239f10f326182a6)
10:45PM INF Preloading models from /models
10:45PM INF Model name: llama-2-13b
10:45PM INF Model name: llama-2-7b-sm
10:45PM INF Model name: llama-2-7b
10:45PM DBG Model: llama-2-7b-sm (config: {PredictionOptions:{Model:llama-2-7b-chat.Q2_K.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama-2-7b-sm F16:true Threads:6 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:1 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:22 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:true DownloadFiles:[] Description: Usage:})
10:45PM DBG Model: llama-2-7b (config: {PredictionOptions:{Model:llama-2-7b-chat.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama-2-7b F16:true Threads:6 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:1 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:22 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:true DownloadFiles:[] Description: Usage:})
10:45PM DBG Model: llama-2-13b (config: {PredictionOptions:{Model:llama-2-13b.Q5_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama-2-13b F16:false Threads:6 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:true DownloadFiles:[] Description: Usage:})
10:45PM DBG Extracting backend assets files to /tmp/localai/backend_data

 ┌───────────────────────────────────────────────────┐
 │                   Fiber v2.50.0                   │
 │               http://127.0.0.1:8080               │
 │       (bound on host 0.0.0.0 and port 8080)       │
 │                                                   │
 │ Handlers ............ 74  Processes ........... 1 │
 │ Prefork ....... Disabled  PID ................ 14 │
 └───────────────────────────────────────────────────┘

10:45PM DBG Request received:
10:45PM DBG `input`: &{PredictionOptions:{Model:llama-2-13b Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Context:context.Background.WithCancel Cancel:0x4a9a40 File: ResponseFormat:{Type:} Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil> Backend: ModelBaseName:}
10:45PM DBG Parameter Config: &{PredictionOptions:{Model:llama-2-13b.Q5_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama-2-13b F16:false Threads:6 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:true Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:true DownloadFiles:[] Description: Usage:}
10:45PM DBG Stopping all backends except 'llama-2-13b.Q5_K_M.gguf'
10:45PM INF Trying to load the model 'llama-2-13b.Q5_K_M.gguf' with all the available backends: llama-cpp, llama-ggml, llama, gpt4all, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers/run.sh
10:45PM INF [llama-cpp] Attempting to load
10:45PM INF Loading model 'llama-2-13b.Q5_K_M.gguf' with backend llama-cpp
10:45PM DBG Loading model in memory from file: /models/llama-2-13b.Q5_K_M.gguf
10:45PM DBG Loading Model llama-2-13b.Q5_K_M.gguf with gRPC (file: /models/llama-2-13b.Q5_K_M.gguf) (backend: llama-cpp): {backendString:llama-cpp model:llama-2-13b.Q5_K_M.gguf threads:6 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0002bc5a0 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
10:45PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
10:45PM DBG GRPC Service for llama-2-13b.Q5_K_M.gguf will be running at: '127.0.0.1:41245'
10:45PM DBG GRPC Service state dir: /tmp/go-processmanager930949986
10:45PM DBG GRPC Service Started
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stdout Server listening on 127.0.0.1:41245
10:45PM DBG GRPC Service Ready
10:45PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:llama-2-13b.Q5_K_M.gguf ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:6 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llama-2-13b.Q5_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr ggml_init_cublas: found 1 CUDA devices:
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr   Device 0: NVIDIA GeForce RTX 2070 with Max-Q Design, compute capability 7.5, VMM: yes
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /models/llama-2-13b.Q5_K_M.gguf (version GGUF V2)
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   0:                       general.architecture str              = llama
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   4:                          llama.block_count u32              = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  10:                          general.file_type u32              = 17
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - kv  18:               general.quantization_version u32              = 2
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - type  f32:   81 tensors
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - type q5_K:  241 tensors
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_model_loader: - type q6_K:   41 tensors
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ).
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: format           = GGUF V2
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: arch             = llama
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: vocab type       = SPM
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_vocab          = 32000
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_merges         = 0
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_ctx_train      = 4096
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_embd           = 5120
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_head           = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_head_kv        = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_layer          = 40
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_rot            = 128
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_embd_head_k    = 128
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_embd_head_v    = 128
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_gqa            = 1
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_embd_k_gqa     = 5120
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_embd_v_gqa     = 5120
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: f_norm_eps       = 0.0e+00
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_ff             = 13824
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_expert         = 0
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_expert_used    = 0
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: rope scaling     = linear
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: freq_base_train  = 10000.0
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: freq_scale_train = 1
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: n_yarn_orig_ctx  = 4096
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: rope_finetuned   = unknown
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: model type       = 13B
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: model ftype      = Q5_K - Medium
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: model params     = 13.02 B
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: model size       = 8.60 GiB (5.67 BPW)
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: general.name     = LLaMA v2
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: BOS token        = 1 '<s>'
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: EOS token        = 2 '</s>'
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: UNK token        = 0 '<unk>'
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_print_meta: LF token         = 13 '<0x0A>'
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_tensors: ggml ctx size       =    0.14 MiB
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_tensors: using CUDA for GPU acceleration
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_tensors: system memory used  = 8801.77 MiB
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_tensors: offloading 0 repeating layers to GPU
10:45PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llm_load_tensors: offloaded 0/41 layers to GPU
[127.0.0.1]:59144 200 - GET /readyz
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr ...................................................................................................
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_new_context_with_model: n_ctx      = 4096
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_new_context_with_model: freq_base  = 10000.0
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_new_context_with_model: freq_scale = 1
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_new_context_with_model: KV self size  = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_build_graph: non-view tensors processed: 844/844
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr llama_new_context_with_model: compute buffer total size = 361.19 MiB
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr Available slots:
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr  -> Slot 0 - max context: 4096
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr all slots are idle and system prompt is empty, clear the KV cache
10:46PM INF [llama-cpp] Loads OK
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr slot 0 is processing [task id: 0]
10:46PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr slot 0 : kv cache rm - [0, end)
[127.0.0.1]:38308 200 - GET /readyz
[127.0.0.1]:59984 200 - GET /readyz
^[[127.0.0.1]:36880 200 - GET /readyz
[127.0.0.1]:45508 200 - GET /readyz
10:50PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr
10:50PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr print_timings: prompt eval time =    5121.41 ms /    13 tokens (  393.95 ms per token,     2.54 tokens per second)
10:50PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr print_timings:        eval time =  229925.06 ms /   433 runs   (  531.00 ms per token,     1.88 tokens per second)
10:50PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr print_timings:       total time =  235046.47 ms
10:50PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:41245): stderr slot 0 released (447 tokens in cache)
10:50PM DBG Response: {"created":1704840303,"object":"text_completion","id":"5ec9286c-1d69-4823-bc8f-8c9b36645381","model":"llama-2-13b","choices":[{"index":0,"finish_reason":"stop","text":", a young man named Luke Skywalker was thrust into a world of adventure when he met a mysterious old man named Obi-Wan Kenobi. Obi-Wan told Luke about the Force, an energy field created by all living things that binds the galaxy together. Obi-Wan also told Luke about the dark side of the Force, a side that could be used for evil.\nLuke was determined to learn more about the Force and become a Jedi Knight, a member of an ancient order of warriors who used the Force for good. He trained with Obi-Wan and other Jedi Masters, learning how to use the Force to control his emotions and to fight with a lightsaber, a weapon that was powered by the Force.\nAs Luke’s training progressed, he learned that the dark side of the Force was strong in his father, Darth Vader, who had turned to the dark side and become a powerful Sith Lord. Luke was determined to save his father from the dark side, but he knew that it would be a difficult task.\nLuke’s journey took him to many different planets, where he faced many challenges and met many new friends and allies. He fought against the Empire, a tyrannical government that was led by the Emperor, a powerful Sith Lord who was determined to rule the galaxy.\nLuke’s journey was not easy, but he was determined to use the Force for good and to save his father from the dark side. He faced many challenges and made many sacrifices, but in the end, he was able to defeat the Emperor and restore peace to the galaxy.\nLuke’s story is one of courage, determination, and the power of the Force. It is a story that has inspired generations of fans and continues to be a beloved part of pop culture.\nPrevious Post: The Power of the Force: The Story of Luke Skywalker\nNext Post: The Force Awakens: The Story of Luke Skywalker"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.17.0.1]:42966 200 - POST /v1/completions

The lines that change when I set the GPU layers:

10:59PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:40071): stderr llm_load_tensors: offloading 1 repeating layers to GPU
10:59PM DBG GRPC(llama-2-13b.Q5_K_M.gguf-127.0.0.1:40071): stderr llm_load_tensors: offloaded 1/41 layers to GPU

nvidia-smi output when running the inference with 1 GPU layer offloaded:

Wed Jan 10 00:32:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36                 Driver Version: 546.33       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  | 00000000:01:00.0  On |                  N/A |
| N/A   79C    P5              25W /  85W |   3052MiB /  8192MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        22      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        28      C   /llama-cpp                                N/A      |
+---------------------------------------------------------------------------------------+

Expected behavior

GPU utilisation should make the model run faster not slower or erroneous.

dionysius commented 8 months ago

run faster not slower

(I don't have experience about running time and llama.cpp internals) Usually you want to send more than just 1 layer - like as many as possible - to the gpu for faster times. But in the end its a valid configuration. Be aware the first call also boots up the model so some time is lost there as well. Can you provide any kind of time benchmark between those options? E.g. Prepend the time command to your curl call and/or the log output regarding `stderr print_timings:``?

or erroneous

Once this case happened, whats the output of curl and the local-ai log? Does it throw anything else than 200 - POST /v1/completions? Did you make sure that you don't run into your curl's default timeout?

dionysius commented 8 months ago

TLDR:

I was interested how it looked on my machine:

Once a specific model is loaded, it seems it doesn't reload with new settings. So restarting localai in between.
I queried the chat endpoint to directly get a result.
1layer is slower than cpu only. Tried with different output sizes and at first glance looks linear. Seems like there is an overhead cost when mixing.
sending 5 layers (with my model) is already faster than cpu only. The more layers I configure the faster it gets.

Details:

docker:

docker run -d --name localai -p 8080:8080 -v models:/models --gpus all -e MODELS_PATH=/models -e DEBUG=true -e THREADS=6 quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg-cor

script:

#!/bin/bash

echo
echo "restart localai container"
docker restart localai
sleep 5

echo
echo "warmup cpu"
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-cpu",
  "messages": [{"role": "user", "content": "Hello! Give only a short answer."}]
}' >/dev/null 2>&1

echo
echo "cpu"
time curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-cpu",
  "messages": [{"role": "assistant", "content": "How old is Mickey Mouse?"}]
}'

echo
echo "restart localai container"
docker restart localai
sleep 5

echo
echo "warmup 1layer"
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-1layer",
  "messages": [{"role": "user", "content": "Hello! Give only a short answer."}]
}' >/dev/null 2>&1

echo
echo "1layer"
time curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-1layer",
  "messages": [{"role": "assistant", "content": "How old is Mickey Mouse?"}]
}'

echo
echo "restart localai container"
docker restart localai
sleep 5

echo
echo "warmup gpu"
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-gpu",
  "messages": [{"role": "user", "content": "Hello! Give only a short answer."}]
}' >/dev/null 2>&1

echo
echo "gpu"
time curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mistral-test-gpu",
  "messages": [{"role": "assistant", "content": "How old is Mickey Mouse?"}]
}'

output:

bash test.sh

restart localai container
localai

warmup cpu

cpu
{"created":1704887210,"object":"chat.completion","id":"09166843-6cb6-4e6e-874a-525f0aaa47a8","model":"mistral-test-cpu","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\n Mickey Mouse, a fictional character created in 1928, does not have an age in the traditional sense. However, if we were to consider Mickey's age in terms of 
the character's existence, he would be 93 years old as of 2021."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
real    0m5.365s
user    0m0.003s
sys     0m0.000s

restart localai container
localai

warmup 1layer

1layer
{"created":1704887237,"object":"chat.completion","id":"2d4ac21a-c2ba-4e4f-a5bb-ef3e5c724487","model":"mistral-test-1layer","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\n Mickey Mouse, a fictional character created in 1928, does not have an age in the traditional sense. However, if we were to consider Mickey's age in terms 
of the character's existence, he would be 93 years old as of 2021."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
real    0m5.847s
user    0m0.003s
sys     0m0.000s

restart localai container
localai

warmup gpu

gpu
{"created":1704887264,"object":"chat.completion","id":"2969ee7a-5ecb-45c1-a015-2b7af618391e","model":"mistral-test-gpu","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\n Mickey Mouse, a fictional character created in 1928, does not have an age in the traditional sense. However, if we were to consider Mickey's age in terms of 
the character's existence, he would be 93 years old as of 2021."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
real    0m1.428s
user    0m0.003s
sys     0m0.000s

configs (separate in their respective files):

context_size: 1024
name: mistral-test-cpu
parameters:
  model: mistral-7b-openorca.Q3_K_S.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 6
gpu_layers: 0
---
context_size: 1024
name: mistral-test-1layer
parameters:
  model: mistral-7b-openorca.Q3_K_S.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 6
gpu_layers: 1
---
context_size: 1024
name: mistral-test-gpu
parameters:
  model: mistral-7b-openorca.Q3_K_S.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 6
gpu_layers: 100

cpu:

11:30AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr llm_load_tensors: using CUDA for GPU acceleration
11:30AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr WARNING: failed to allocate 3017.28 MB of pinned memory: out of memory
11:30AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr llm_load_tensors: system memory used  = 3017.39 MiB
11:30AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr llm_load_tensors: offloading 0 repeating layers to GPU
11:30AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr llm_load_tensors: offloaded 0/33 layers to GPU
...
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:34563): stderr print_timings:       total time =   5358.12 ms

1layer:

11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr llm_load_tensors: using CUDA for GPU acceleration
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr WARNING: failed to allocate 2927.87 MB of pinned memory: out of memory
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr llm_load_tensors: system memory used  = 2927.98 MiB
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr llm_load_tensors: VRAM used           =   89.41 MiB
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr llm_load_tensors: offloading 1 repeating layers to GPU
11:31AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr llm_load_tensors: offloaded 1/33 layers to GPU
...
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:39319): stderr print_timings:       total time =   5840.45 ms

gpu:

11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: using CUDA for GPU acceleration
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: system memory used  =   53.83 MiB
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: VRAM used           = 2963.56 MiB
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: offloading 32 repeating layers to GPU
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: offloading non-repeating layers to GPU
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr llm_load_tensors: offloaded 33/33 layers to GPU
...
11:32AM DBG GRPC(mistral-7b-openorca.Q3_K_S.gguf-127.0.0.1:35869): stderr print_timings:       total time =    1421.91 ms

mudler / LocalAI

GPU layers param breaks the model i.e. I am not able to utilise my GPU for llama 2 #1570