Can't switch backend on GPU after diffusers backend is used once

LocalAI version: local-ai 2.1.0 (TrueCharts chart Version: 6.6.1) Cublas Cuda 11 + FFmpeg image

Environment, CPU architecture, OS, and Version: uname -a

Linux truenas 6.1.63-production+truenas #2 SMP PREEMPT_DYNAMIC Mon Dec 18 19:34:42 UTC 2023 x86_64 GNU/Linux

OS: TrueNAS-SCALE-23.10.1 Cobia CPU: Intel Xeon W1290 MB: ASUS Pro WS W480-ACE RAM: 4x32GB Kingston ECC 2933MHz Boot pool: 2x 256GB Samsung 870 Evo Apps pool: 2x 2TB Samsung 970 Evo Plus HBA: Broadcom 9405W-16i Tri-Mode Storage Adapter SAS3616 GPU: ASUS ROG Strix GeForce GTX 1070 OC

Describe the bug After using the diffusers backend once over GPU, every query requesting a different backend fails.

To Reproduce

I am using a fresh install of local-ai 2.1.0 on my SCALE server using CUDA 11, a NVIDIA GTX 1070 gpu, 4 threads and 8GiB max for the container (started with --gpus all).
I configured the following models for GPU inference:

image

name: image
parameters:
  model: SG161222/Realistic_Vision_V4.0_noVAE
backend: diffusers

# Force CPU usage - set to true for GPU
f16: true
gpu_layers: 33
step: 21
diffusers:
  cuda: true # Enable for GPU usage (CUDA)
  scheduler_type: k_dpmpp_sde
  cfg_scale: 2.5
  clip_skip: 1

text

backend: llama
context_size: 2000
f16: true
gpu_layers: 33
mmap: true
##Put settings right here for tunning!! Before name but after Backend!
name: text
parameters:
  model: luna-ai-llama2-uncensored.Q4_K_M.gguf
roles:
  assistant: 'ASSISTANT:'
  system: 'SYSTEM:'
  user: 'USER:'
template:
  chat: luna-chat
  completion: luna-completion

I made sure to enable SINGLE_ACTIVE_BACKEND after reading https://github.com/mudler/LocalAI/pull/925. I also setup the watchdog envvars accordingly after reading https://github.com/mudler/LocalAI/issues/1202 and https://github.com/mudler/LocalAI/issues/892. Relevant envvars in the container are as following:

BUILD_TYPE=cublas
CORS=true
CORS_ALLOW_ORIGINS=*
DEBUG=true
EXTERNAL_GRPC_BACKENDS=huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh,petals:/build/backend/python/petals/run.sh,transformers:/build/backend/python/transformers/run.sh,sentencetransformers:/build/backend/python/sentencetransformers/run.sh,autogptq:/build/backend/python/autogptq/run.sh,bark:/build/backend/python/bark/run.sh,diffusers:/build/backend/python/diffusers/run.sh,exllama:/build/backend/python/exllama/run.sh,vall-e-x:/build/backend/python/vall-e-x/run.sh,vllm:/build/backend/python/vllm/run.sh,exllama2:/build/backend/python/exllama2/run.sh,transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh
GALLERIES=[{"name":"model-gallery","url":"github:go-skynet/model-gallery/index.yaml"},{"name":"huggingface","url":"github:go-skynet/model-gallery/huggingface.yaml"}]
HEALTHCHECK_ENDPOINT=http://localhost:8080/readyz
IMAGE_PATH=/images
MODELS_PATH=/models
NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_REQUIRE_CUDA=cuda>=11.0
NVIDIA_VISIBLE_DEVICES=GPU-3ab9c8fb-1166-f464-0b2c-e2ac905f005f
PRELOAD_MODELS=[{"url":"github:go-skynet/model-gallery/diffusers.yaml"}]
REBUILD=false
SINGLE_ACTIVE_BACKEND=true
WATCHDOG_BUSY=true
WATCHDOG_BUSY_TIMEOUT=5m
WATCHDOG_IDLE=true
WATCHDOG_IDLE_TIMEOUT=5m

I then start the container, and send a first query using the text model --> this step loads lama.cpp backend
I get a reply, wait a few minutes and then try again with a query using the image model. --> this step loads diffusers backend. It then starts downloading all the necessary files, loads the model (it takes a bit the first time) and then spits out the image as intended.

So far everything works. I can send new image queries, and they all work.

Now I wait some more and send a new query using the text model again. This is what I get instead this time, inference fails and I get nothing back (full step by step logs available in the Logs section below)

6:49AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:36897): stderr CUDA error 2 at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: out of memory
6:49AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:36897): stderr current device: 0
6:49AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:36897): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: !"CUDA error"

After this point no query works: text queries because of the Out of memory problem, image queries I don't know. If I then wait for the WATCHDOG_IDLE_TIMEOUT to kick in, I get:

6:52AM WRN [WatchDog] Address 127.0.0.1:36897 is idle for too long, killing it
6:52AM ERR [watchdog] Error shutting down model luna-ai-llama2-uncensored.Q4_K_M.gguf: model luna-ai-llama2-uncensored.Q4_K_M.gguf not found

Also, if I let the idle watchdog expire after the first image query, further imagequeries fail with an HTTP error 500 after generation:

Loading pipeline components...:  33%|███▎      | 2/6 [00:00<00:00, 19.41it/s]/opt/conda/envs/diffusers/lib/python3.11/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
7:49AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:33677): stderr   warnings.warn(
Loading pipeline components...: 100%|██████████| 6/6 [00:00<00:00, 20.73it/s]
[172.16.2.70]:51988 500 - POST /v1/images/generations

Expected behavior A query requesting a new backend should cause the currently loaded model, if any, in the GPU VRAM to be unloaded and replaced with the one requested, so that queries keep working. While this works when switching from a llama.cpp backend to a diffusers backend, it doesn't work viceversa. Alternatively, the watchdog should be able to kill the existing model after the specified WATCHDOG_IDLE_TIMEOUT value so that further queries with different backends work.

Logs

startup logs for my local-ai instance

Details

``` @@@@@ Skipping rebuild @@@@@ If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed: CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" see the documentation at: https://localai.io/basics/build/index.html Note: See also https://github.com/go-skynet/LocalAI/issues/288 @@@@@ CPU info: model name : Intel(R) Xeon(R) W-1290 CPU @ 3.20GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities CPU: AVX found OK CPU: AVX2 found OK CPU: no AVX512 found @@@@@ 6:25AM INF Starting LocalAI using 4 threads, with models path: /models 6:25AM INF LocalAI version: v2.1.0 (3d83128f169de3676b341245b985af2e50da9c0f) 6:25AM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: text (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: image (config: {PredictionOptions:{Model:SG161222/Realistic_Vision_V4.0_noVAE Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:image F16:true Threads:0 Debug:false Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType: SchedulerType:k_dpmpp_sde EnableParameters: CFGScale:2.5 IMG2IMG:false ClipSkip:1 ClipModel: ClipSubFolder: ControlNet:} Step:21 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: diffusers (config: {PredictionOptions:{Model:diffusers_assets Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:diffusers F16:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Extracting backend assets files to /tmp/localai/backend_data 6:25AM DBG Checking "diffusers_assets/AutoencoderKL-256-256-fp16-opt.param" exists and matches SHA 6:25AM DBG File "diffusers_assets/AutoencoderKL-256-256-fp16-opt.param" already exists and matches the SHA. Skipping download 6:25AM DBG Checking "diffusers_assets/AutoencoderKL-512-512-fp16-opt.param" exists and matches SHA ..... 6:25AM DBG File "diffusers_assets/UNetModel-MHA-fp16.bin" already exists and matches the SHA. Skipping download 6:25AM DBG Checking "diffusers_assets/vocab.txt" exists and matches SHA 6:25AM DBG File "diffusers_assets/vocab.txt" already exists and matches the SHA. Skipping download 6:25AM DBG Written config file /models/diffusers.yaml 6:25AM INF [WatchDog] starting watchdog ┌───────────────────────────────────────────────────┐ │ Fiber v2.50.0 │ │ http://127.0.0.1:8080 │ │ (bound on host 0.0.0.0 and port 8080) │ │ │ │ Handlers ............ 75 Processes ........... 1 │ │ Prefork ....... Disabled PID ................ 20 │ └───────────────────────────────────────────────────┘ [172.16.0.1]:37444 200 - GET /readyz [172.16.0.1]:37458 200 - GET /readyz [172.16.0.1]:37474 200 - GET /readyz [172.16.0.1]:37460 200 - GET /readyz [172.16.0.1]:38788 200 - GET /readyz [172.16.0.1]:38790 200 - GET /readyz 6:26AM DBG [WatchDog] Watchdog checks for busy connections 6:26AM DBG [WatchDog] Watchdog checks for idle connections ```

successful inference with text model (lama.cpp backend)

Details

``` 6:30AM DBG Request received: 6:30AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:30AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:30AM DBG Prompt (before templating): USER:how are you? 6:30AM DBG Template found, input modified to: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:30AM DBG Prompt (after templating): Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:30AM INF Loading model 'luna-ai-llama2-uncensored.Q4_K_M.gguf' with backend llama 6:30AM DBG llama-cpp is an alias of llama-cpp 6:30AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf' 6:30AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf 6:30AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama-cpp): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00027c960 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:30AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 6:30AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: '127.0.0.1:46101' 6:30AM DBG GRPC Service state dir: /tmp/go-processmanager3514322108 6:30AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46101: connect: connection refused" 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stdout Server listening on 127.0.0.1:46101 6:30AM DBG GRPC Service Ready 6:30AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:33 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: found 1 CUDA devices: 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/luna-ai-llama2-uncensored.Q4_K_M.gguf (version GGUF V2) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ] 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] ......... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ] 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 0: general.architecture str = llama 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 1: general.name str = tap-m_luna-ai-llama2-uncensored 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 2: llama.context_length u32 = 2048 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 10: general.file_type u32 = 15 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str = llama 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 18: general.quantization_version u32 = 2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type f32: 65 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type q4_K: 193 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type q6_K: 33 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ). 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: format = GGUF V2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: arch = llama ....... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: LF token = 13 '<0x0A>' 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: ggml ctx size = 0.11 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: using CUDA for GPU acceleration 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: mem required = 70.42 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading 32 repeating layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading non-repeating layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloaded 33/33 layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: VRAM used: 3820.93 MiB [172.16.0.1]:41296 200 - GET /readyz [172.16.0.1]:41294 200 - GET /readyz 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr .................................................................................................. 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: n_ctx = 2000 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: freq_base = 10000.0 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: freq_scale = 1 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_kv_cache_init: VRAM kv self = 1000.00 MB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: KV self size = 1000.00 MiB, K (f16): 500.00 MiB, V (f16): 500.00 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_build_graph: non-view tensors processed: 676/676 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: compute buffer total size = 156.10 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: VRAM scratch buffer: 152.91 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: total VRAM used: 4973.84 MiB (model: 3820.93 MiB, context: 1152.91 MiB) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr Available slots: 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr -> Slot 0 - max context: 2000 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 is processing [task id: 0] 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 : kv cache rm - [0, end) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: prompt eval time = 220.71 ms / 49 tokens ( 4.50 ms per token, 222.01 tokens per second) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: eval time = 658.47 ms / 15 runs ( 43.90 ms per token, 22.78 tokens per second) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: total time = 879.18 ms 6:30AM DBG Response: {"created":1703654745,"object":"chat.completion","id":"e99ce4f3-7475-46ad-b98b-682f557a67ea","model":"text","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I'm doing well, thank you for asking. How about you?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} [172.16.2.70]:58482 200 - POST /v1/chat/completions 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 released (65 tokens in cache) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr all slots are idle and system prompt is empty, clear the KV cache ```

successful inference using image model (diffusers backend)

Details

``` 6:34AM DBG Request received: 6:34AM DBG Loading model: image 6:34AM DBG Parameter Config: &{PredictionOptions:{Model:SG161222/Realistic_Vision_V4.0_noVAE Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:image F16:true Threads:4 Debug:true Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[pink horse|bad art, ugly face, messed up face, poorly drawn hands, bad hands, professional photo shoot, makeup, photoshop, doll, plastic_doll, silicone, anime, cartoon, fake, filter, airbrush, 3d max, infant, featureless, colourless, impassive, shaders, Watermark, Text, censored, deformed, bad anatomy, disfigured, poorly drawn face, mutated, extra limb, ugly, poorly drawn hands, missing limb, floating limbs, disconnected limbs, disconnected head, malformed hands, long neck, mutated hands and fingers, bad hands, missing fingers, cropped, worst quality, low quality, mutation, poorly drawn, huge calf, bad hands, fused hand, missing hand, disappearing arms, disappearing thigh, disappearing calf, disappearing legs, missing fingers, fused fingers, abnormal eye proportion, Abnormal hands, abnormal legs, abnormal feet, abnormal fingers] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType: SchedulerType:k_dpmpp_sde EnableParameters: CFGScale:2.5 IMG2IMG:false ClipSkip:1 ClipModel: ClipSubFolder: ControlNet:} Step:21 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:34AM INF Loading model 'SG161222/Realistic_Vision_V4.0_noVAE' with backend diffusers 6:34AM DBG Stopping all backends except 'SG161222/Realistic_Vision_V4.0_noVAE' 6:34AM DBG Loading model in memory from file: /models/SG161222/Realistic_Vision_V4.0_noVAE 6:34AM DBG Loading Model SG161222/Realistic_Vision_V4.0_noVAE with gRPC (file: /models/SG161222/Realistic_Vision_V4.0_noVAE) (backend: diffusers): {backendString:diffusers model:SG161222/Realistic_Vision_V4.0_noVAE threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00027c960 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:34AM DBG Loading external backend: /build/backend/python/diffusers/run.sh 6:34AM DBG Loading GRPC Process: /build/backend/python/diffusers/run.sh 6:34AM DBG GRPC Service for SG161222/Realistic_Vision_V4.0_noVAE will be running at: '127.0.0.1:45251' 6:34AM DBG GRPC Service state dir: /tmp/go-processmanager3795930822 6:34AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45251: connect: connection refused" 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. 0it [00:00, ?it/s]SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Server started. Listening on: 127.0.0.1:45251 6:34AM DBG GRPC Service Ready 6:34AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:SG161222/Realistic_Vision_V4.0_noVAE ContextSize:0 Seed:0 NBatch:0 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:0 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/SG161222/Realistic_Vision_V4.0_noVAE Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType:k_dpmpp_sde CUDA:true CFGScale:2.5 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:1 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Loading model SG161222/Realistic_Vision_V4.0_noVAE... 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Request Model: "SG161222/Realistic_Vision_V4.0_noVAE" 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr ModelFile: "/models/SG161222/Realistic_Vision_V4.0_noVAE" 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr SchedulerType: "k_dpmpp_sde" 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CUDA: true 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CFGScale: 2.5 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CLIPSkip: 1 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr model_index.json: 100%|██████████| 513/513 [00:00<00:00, 4.12MB/s] 6:34AM DBG [WatchDog] Watchdog checks for busy connections 6:34AM DBG [WatchDog] 127.0.0.1:45251: active connection 6:34AM DBG [WatchDog] Watchdog checks for idle connections (…)ature_extractor/preprocessor_config.json: 100%|██████████| 520/520 [00:00<00:00, 4.90MB/s] tokenizer/special_tokens_map.json: 100%|██████████| 472/472 [00:00<00:00, 3.98MB/s] text_encoder/config.json: 100%|██████████| 612/612 [00:00<00:00, 5.47MB/s] scheduler/scheduler_config.json: 100%|██████████| 725/725 [00:00<00:00, 6.99MB/s]?B/s] unet/config.json: 100%|██████████| 1.61k/1.61k [00:00<00:00, 16.5MB/s]]?B/s] tokenizer/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 1.06MB/s] tokenizer/tokenizer_config.json: 100%|██████████| 737/737 [00:00<00:00, 7.54MB/s] vae/config.json: 100%|██████████| 582/582 [00:00<00:00, 4.82MB/s]8MB/s]]s] tokenizer/vocab.json: 100%|██████████| 1.06M/1.06M [00:00<00:00, 2.09MB/s] ?B/s] 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr .6MB/s] [172.16.0.1]:51272 200 - GET /readyzus-127.0.0.1:45251): stderr 7MB/s] 0<02:46, 20.6MB/s] model.safetensors: 100%|██████████| 492M/492M [00:11<00:00, 41.3MB/s]:01<01:14, 45.0MB/s] 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr 8MB/s]B/s] [172.16.0.1]:50286 200 - GET /readyzus-127.0.0.1:45251): stderr 7MB/s]1<01:26, 34.8MB/s] diffusion_pytorch_model.safetensors: 100%|██████████| 335M/335M [00:11<00:00, 28.9MB/s]] model.safetensors: 100%|██████████| 1.22G/1.22G [00:20<00:00, 59.1MB/s] Fetching 14 files: 21%|██▏ | 3/14 [00:21<01:24, 7.70s/it]4MB/s]<00:46, 63.8MB/s] 6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr G [00:19<00:23, 96.1MB/s] [172.16.0.1]:33770 200 - GET /readyz 46%|████▌ | 1.57G/3.44G [00:21<00:08, 212MB/s] 6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr [00:11<00:00, 41.6MB/s] 6:35AM DBG [WatchDog] Watchdog checks for busy connectionsG/3.44G [00:28<00:01, 227MB/s] diffusion_pytorch_model.safetensors: 100%|██████████| 3.44G/3.44G [00:30<00:00, 112MB/s] Fetching 14 files: 100%|██████████| 14/14 [00:31<00:00, 2.28s/it] 6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Keyword arguments {'guidance_scale': 2.5} are not expected by diffusersPipeline and will be ignored. Loading pipeline components...: 0%| | 0/6 [00:00

local-ai logs when sending a new query with a different backend, after the diffusers backend has been loaded once.

Details

``` 6:16AM DBG Request received: 6:16AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:16AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:16AM DBG Prompt (before templating): USER:how are you? 6:16AM DBG Template found, input modified to: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:16AM DBG Prompt (after templating): Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:16AM INF Loading model 'luna-ai-llama2-uncensored.Q4_K_M.gguf' with backend llama 6:16AM DBG llama-cpp is an alias of llama-cpp 6:16AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf' 6:16AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf 6:16AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama-cpp): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000392b40 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:16AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 6:16AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: '127.0.0.1:34401' 6:16AM DBG GRPC Service state dir: /tmp/go-processmanager2101771857 6:16AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34401: connect: connection refused" 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stdout Server listening on 127.0.0.1:34401 [172.16.0.1]:55300 200 - GET /readyz [172.16.0.1]:55298 200 - GET /readyz 6:16AM DBG GRPC Service Ready 6:16AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:33 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: found 1 CUDA devices: 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/luna-ai-llama2-uncensored.Q4_K_M.gguf (version GGUF V2) 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ] 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] ....... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ] 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ] 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 0: general.architecture str = llama 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 1: general.name str = tap-m_luna-ai-llama2-uncensored 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 2: llama.context_length u32 = 2048 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 10: general.file_type u32 = 15 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str = llama 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 18: general.quantization_version u32 = 2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type f32: 65 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type q4_K: 193 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type q6_K: 33 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ). 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: format = GGUF V2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: arch = llama ...... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: LF token = 13 '<0x0A>' 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: ggml ctx size = 0.11 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: using CUDA for GPU acceleration 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: mem required = 70.42 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloading 32 repeating layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloading non-repeating layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloaded 33/33 layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: VRAM used: 3820.93 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr .............................. 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr CUDA error 2 at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: out of memory 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr current device: 0 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: !"CUDA error" [172.16.2.70]:60382 500 - POST /v1/chat/completions ```

Additional context I waited some time before opening the ticket: local-ai has been working perfectly so far and I only discovered this since I added some diffusers models and started playing with them. I had high hopes after discovering SINGLE_ACTIVE_BACKEND in the issues of the repo: while it now works when switching from text to image (it didn't before), after invoking diffusers just once I get stuck as described.

I also discovered the various watchdogs settings, and had high hopes as well, but as I mentioned it doesn't seem to make it right. It goes without saying that I started with no SINGLE_ACTIVE_BACKEND nor watchdog, and I started debugging from there, so I already tried the various combinations to no avail.

I can share more details if needed, thanks again for this amazing app.

mudler / LocalAI

Can't switch backend on GPU after diffusers backend is used once #1498