Closed mariopaolo closed 3 months ago
Hi, im experiencing the same thing. The only way I can use a text query after an image query is if the image query does not use GPU and only CPU. Yes, this makes the image generation slower but at least I don't have to docker compose restart
everytime I create an image. Would be great if I can use GPU do to the speed of it.
BTW, I noticed @mariopaolo also also have gpu_layers: 33
in your image.yaml. I don't think this is actually doing anything (just being ignored), but I could be mistaken.
@Anto79-ops both sad and happy to hear you are experiencing the same issue 😅
regarding the gpu_layers
parameter, I have successfully set it for text
generation, I previously had 10 and I could see it in the logs. In the enclosed log for successful llama.cpp
inference you can now see:
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: using CUDA for GPU acceleration
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: mem required = 70.42 MiB
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading 32 repeating layers to GPU
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading non-repeating layers to GPU
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloaded 33/33 layers to GPU
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: VRAM used: 3820.93 MiB
but you are right for diffusers
, since i can't see it mentioned anywhere in the logs (and haven't yet played with it extensively)
I just probably copy pasted the basic structure from luna
model, thanks for pointing it out.
I was wondering if there was an update on this. The problem still exists on v2.8.2.
It's seems to clear the GPU memory at the start of the image generation query, but subsequent text queries crash as it does not clear the GPU from the previous image query.
I'm also having this issue. Some visualization:
First model is (from aio image) "stablediffusion". Then Text, then Multimodal text, then another stable diffusion. Both the first and last stable diffusion do NOT get cleared.
EDIT: Also, I suspect this is the problematic function call: https://github.com/mudler/LocalAI/blob/master/pkg/model/initializers.go#L187
EDIT2: Ok, it tries to stop the "stablediffusion", but it doesn't (https://github.com/mudler/LocalAI/blob/master/pkg/model/process.go#L24):
this should have been fixed by #2720 . Closing
LocalAI version:
local-ai
2.1.0 (TrueCharts chart Version: 6.6.1) Cublas Cuda 11 + FFmpeg imageEnvironment, CPU architecture, OS, and Version: uname -a
OS: TrueNAS-SCALE-23.10.1 Cobia CPU: Intel Xeon W1290 MB: ASUS Pro WS W480-ACE RAM: 4x32GB Kingston ECC 2933MHz Boot pool: 2x 256GB Samsung 870 Evo Apps pool: 2x 2TB Samsung 970 Evo Plus HBA: Broadcom 9405W-16i Tri-Mode Storage Adapter SAS3616 GPU: ASUS ROG Strix GeForce GTX 1070 OC
Describe the bug After using the
diffusers
backend once over GPU, every query requesting a different backend fails.To Reproduce
I am using a fresh install of
local-ai
2.1.0 on my SCALE server using CUDA 11, a NVIDIA GTX 1070 gpu, 4 threads and 8GiB max for the container (started with--gpus all
).I configured the following models for GPU inference:
image
text
SINGLE_ACTIVE_BACKEND
after reading https://github.com/mudler/LocalAI/pull/925. I also setup the watchdog envvars accordingly after reading https://github.com/mudler/LocalAI/issues/1202 and https://github.com/mudler/LocalAI/issues/892. Relevant envvars in the container are as following:text
model --> this step loadslama.cpp
backendimage
model. --> this step loadsdiffusers
backend. It then starts downloading all the necessary files, loads the model (it takes a bit the first time) and then spits out the image as intended.So far everything works. I can send new
image
queries, and they all work.text
model again. This is what I get instead this time, inference fails and I get nothing back (full step by step logs available in the Logs section below)text
queries because of theOut of memory
problem,image
queries I don't know. If I then wait for theWATCHDOG_IDLE_TIMEOUT
to kick in, I get:image
query, furtherimage
queries fail with an HTTP error 500 after generation:Expected behavior A query requesting a new backend should cause the currently loaded model, if any, in the GPU VRAM to be unloaded and replaced with the one requested, so that queries keep working. While this works when switching from a
llama.cpp
backend to adiffusers
backend, it doesn't work viceversa. Alternatively, the watchdog should be able to kill the existing model after the specifiedWATCHDOG_IDLE_TIMEOUT
value so that further queries with different backends work.Logs
startup logs for my
local-ai
instanceDetails
``` @@@@@ Skipping rebuild @@@@@ If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed: CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" see the documentation at: https://localai.io/basics/build/index.html Note: See also https://github.com/go-skynet/LocalAI/issues/288 @@@@@ CPU info: model name : Intel(R) Xeon(R) W-1290 CPU @ 3.20GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities CPU: AVX found OK CPU: AVX2 found OK CPU: no AVX512 found @@@@@ 6:25AM INF Starting LocalAI using 4 threads, with models path: /models 6:25AM INF LocalAI version: v2.1.0 (3d83128f169de3676b341245b985af2e50da9c0f) 6:25AM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: text (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:0 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: image (config: {PredictionOptions:{Model:SG161222/Realistic_Vision_V4.0_noVAE Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:image F16:true Threads:0 Debug:false Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType: SchedulerType:k_dpmpp_sde EnableParameters: CFGScale:2.5 IMG2IMG:false ClipSkip:1 ClipModel: ClipSubFolder: ControlNet:} Step:21 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Model: diffusers (config: {PredictionOptions:{Model:diffusers_assets Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:diffusers F16:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false}) 6:25AM DBG Extracting backend assets files to /tmp/localai/backend_data 6:25AM DBG Checking "diffusers_assets/AutoencoderKL-256-256-fp16-opt.param" exists and matches SHA 6:25AM DBG File "diffusers_assets/AutoencoderKL-256-256-fp16-opt.param" already exists and matches the SHA. Skipping download 6:25AM DBG Checking "diffusers_assets/AutoencoderKL-512-512-fp16-opt.param" exists and matches SHA ..... 6:25AM DBG File "diffusers_assets/UNetModel-MHA-fp16.bin" already exists and matches the SHA. Skipping download 6:25AM DBG Checking "diffusers_assets/vocab.txt" exists and matches SHA 6:25AM DBG File "diffusers_assets/vocab.txt" already exists and matches the SHA. Skipping download 6:25AM DBG Written config file /models/diffusers.yaml 6:25AM INF [WatchDog] starting watchdog ┌───────────────────────────────────────────────────┐ │ Fiber v2.50.0 │ │ http://127.0.0.1:8080 │ │ (bound on host 0.0.0.0 and port 8080) │ │ │ │ Handlers ............ 75 Processes ........... 1 │ │ Prefork ....... Disabled PID ................ 20 │ └───────────────────────────────────────────────────┘ [172.16.0.1]:37444 200 - GET /readyz [172.16.0.1]:37458 200 - GET /readyz [172.16.0.1]:37474 200 - GET /readyz [172.16.0.1]:37460 200 - GET /readyz [172.16.0.1]:38788 200 - GET /readyz [172.16.0.1]:38790 200 - GET /readyz 6:26AM DBG [WatchDog] Watchdog checks for busy connections 6:26AM DBG [WatchDog] Watchdog checks for idle connections ```
successful inference with
text
model (lama.cpp backend)Details
``` 6:30AM DBG Request received: 6:30AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:30AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:30AM DBG Prompt (before templating): USER:how are you? 6:30AM DBG Template found, input modified to: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:30AM DBG Prompt (after templating): Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:30AM INF Loading model 'luna-ai-llama2-uncensored.Q4_K_M.gguf' with backend llama 6:30AM DBG llama-cpp is an alias of llama-cpp 6:30AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf' 6:30AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf 6:30AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama-cpp): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00027c960 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:30AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 6:30AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: '127.0.0.1:46101' 6:30AM DBG GRPC Service state dir: /tmp/go-processmanager3514322108 6:30AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46101: connect: connection refused" 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stdout Server listening on 127.0.0.1:46101 6:30AM DBG GRPC Service Ready 6:30AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:33 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr ggml_init_cublas: found 1 CUDA devices:
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/luna-ai-llama2-uncensored.Q4_K_M.gguf (version GGUF V2)
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
.........
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 0: general.architecture str = llama
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 1: general.name str = tap-m_luna-ai-llama2-uncensored
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 2: llama.context_length u32 = 2048
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 10: general.file_type u32 = 15
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
", "", "<0x00>", "<... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - kv 18: general.quantization_version u32 = 2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type f32: 65 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type q4_K: 193 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_model_loader: - type q6_K: 33 tensors 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ). 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: format = GGUF V2 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: arch = llama ....... 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_print_meta: LF token = 13 '<0x0A>' 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: ggml ctx size = 0.11 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: using CUDA for GPU acceleration 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: mem required = 70.42 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading 32 repeating layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloading non-repeating layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: offloaded 33/33 layers to GPU 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llm_load_tensors: VRAM used: 3820.93 MiB [172.16.0.1]:41296 200 - GET /readyz [172.16.0.1]:41294 200 - GET /readyz 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr .................................................................................................. 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: n_ctx = 2000 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: freq_base = 10000.0 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: freq_scale = 1 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_kv_cache_init: VRAM kv self = 1000.00 MB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: KV self size = 1000.00 MiB, K (f16): 500.00 MiB, V (f16): 500.00 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_build_graph: non-view tensors processed: 676/676 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: compute buffer total size = 156.10 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: VRAM scratch buffer: 152.91 MiB 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr llama_new_context_with_model: total VRAM used: 4973.84 MiB (model: 3820.93 MiB, context: 1152.91 MiB) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr Available slots: 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr -> Slot 0 - max context: 2000 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 is processing [task id: 0] 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 : kv cache rm - [0, end) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: prompt eval time = 220.71 ms / 49 tokens ( 4.50 ms per token, 222.01 tokens per second) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: eval time = 658.47 ms / 15 runs ( 43.90 ms per token, 22.78 tokens per second) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr print_timings: total time = 879.18 ms 6:30AM DBG Response: {"created":1703654745,"object":"chat.completion","id":"e99ce4f3-7475-46ad-b98b-682f557a67ea","model":"text","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I'm doing well, thank you for asking. How about you?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} [172.16.2.70]:58482 200 - POST /v1/chat/completions 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr slot 0 released (65 tokens in cache) 6:30AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:46101): stderr all slots are idle and system prompt is empty, clear the KV cache ```successful inference using
image
model (diffusers
backend)Details
``` 6:34AM DBG Request received: 6:34AM DBG Loading model: image 6:34AM DBG Parameter Config: &{PredictionOptions:{Model:SG161222/Realistic_Vision_V4.0_noVAE Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:image F16:true Threads:4 Debug:true Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[pink horse|bad art, ugly face, messed up face, poorly drawn hands, bad hands, professional photo shoot, makeup, photoshop, doll, plastic_doll, silicone, anime, cartoon, fake, filter, airbrush, 3d max, infant, featureless, colourless, impassive, shaders, Watermark, Text, censored, deformed, bad anatomy, disfigured, poorly drawn face, mutated, extra limb, ugly, poorly drawn hands, missing limb, floating limbs, disconnected limbs, disconnected head, malformed hands, long neck, mutated hands and fingers, bad hands, missing fingers, cropped, worst quality, low quality, mutation, poorly drawn, huge calf, bad hands, fused hand, missing hand, disappearing arms, disappearing thigh, disappearing calf, disappearing legs, missing fingers, fused fingers, abnormal eye proportion, Abnormal hands, abnormal legs, abnormal feet, abnormal fingers] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType: SchedulerType:k_dpmpp_sde EnableParameters: CFGScale:2.5 IMG2IMG:false ClipSkip:1 ClipModel: ClipSubFolder: ControlNet:} Step:21 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:34AM INF Loading model 'SG161222/Realistic_Vision_V4.0_noVAE' with backend diffusers 6:34AM DBG Stopping all backends except 'SG161222/Realistic_Vision_V4.0_noVAE' 6:34AM DBG Loading model in memory from file: /models/SG161222/Realistic_Vision_V4.0_noVAE 6:34AM DBG Loading Model SG161222/Realistic_Vision_V4.0_noVAE with gRPC (file: /models/SG161222/Realistic_Vision_V4.0_noVAE) (backend: diffusers): {backendString:diffusers model:SG161222/Realistic_Vision_V4.0_noVAE threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00027c960 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:34AM DBG Loading external backend: /build/backend/python/diffusers/run.sh 6:34AM DBG Loading GRPC Process: /build/backend/python/diffusers/run.sh 6:34AM DBG GRPC Service for SG161222/Realistic_Vision_V4.0_noVAE will be running at: '127.0.0.1:45251' 6:34AM DBG GRPC Service state dir: /tmp/go-processmanager3795930822 6:34AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45251: connect: connection refused" 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. 0it [00:00, ?it/s]SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr 6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Server started. Listening on: 127.0.0.1:45251 6:34AM DBG GRPC Service Ready 6:34AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:SG161222/Realistic_Vision_V4.0_noVAE ContextSize:0 Seed:0 NBatch:0 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:0 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/SG161222/Realistic_Vision_V4.0_noVAE Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType:k_dpmpp_sde CUDA:true CFGScale:2.5 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:1 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Loading model SG161222/Realistic_Vision_V4.0_noVAE...
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Request Model: "SG161222/Realistic_Vision_V4.0_noVAE"
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr ModelFile: "/models/SG161222/Realistic_Vision_V4.0_noVAE"
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr SchedulerType: "k_dpmpp_sde"
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CUDA: true
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CFGScale: 2.5
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr CLIPSkip: 1
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr
model_index.json: 100%|██████████| 513/513 [00:00<00:00, 4.12MB/s]
6:34AM DBG [WatchDog] Watchdog checks for busy connections
6:34AM DBG [WatchDog] 127.0.0.1:45251: active connection
6:34AM DBG [WatchDog] Watchdog checks for idle connections
(…)ature_extractor/preprocessor_config.json: 100%|██████████| 520/520 [00:00<00:00, 4.90MB/s]
tokenizer/special_tokens_map.json: 100%|██████████| 472/472 [00:00<00:00, 3.98MB/s]
text_encoder/config.json: 100%|██████████| 612/612 [00:00<00:00, 5.47MB/s]
scheduler/scheduler_config.json: 100%|██████████| 725/725 [00:00<00:00, 6.99MB/s]?B/s]
unet/config.json: 100%|██████████| 1.61k/1.61k [00:00<00:00, 16.5MB/s]]?B/s]
tokenizer/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 1.06MB/s]
tokenizer/tokenizer_config.json: 100%|██████████| 737/737 [00:00<00:00, 7.54MB/s]
vae/config.json: 100%|██████████| 582/582 [00:00<00:00, 4.82MB/s]8MB/s]]s]
tokenizer/vocab.json: 100%|██████████| 1.06M/1.06M [00:00<00:00, 2.09MB/s] ?B/s]
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr .6MB/s]
[172.16.0.1]:51272 200 - GET /readyzus-127.0.0.1:45251): stderr 7MB/s] 0<02:46, 20.6MB/s]
model.safetensors: 100%|██████████| 492M/492M [00:11<00:00, 41.3MB/s]:01<01:14, 45.0MB/s]
6:34AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr 8MB/s]B/s]
[172.16.0.1]:50286 200 - GET /readyzus-127.0.0.1:45251): stderr 7MB/s]1<01:26, 34.8MB/s]
diffusion_pytorch_model.safetensors: 100%|██████████| 335M/335M [00:11<00:00, 28.9MB/s]]
model.safetensors: 100%|██████████| 1.22G/1.22G [00:20<00:00, 59.1MB/s]
Fetching 14 files: 21%|██▏ | 3/14 [00:21<01:24, 7.70s/it]4MB/s]<00:46, 63.8MB/s]
6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr G [00:19<00:23, 96.1MB/s]
[172.16.0.1]:33770 200 - GET /readyz 46%|████▌ | 1.57G/3.44G [00:21<00:08, 212MB/s]
6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr [00:11<00:00, 41.6MB/s]
6:35AM DBG [WatchDog] Watchdog checks for busy connectionsG/3.44G [00:28<00:01, 227MB/s]
diffusion_pytorch_model.safetensors: 100%|██████████| 3.44G/3.44G [00:30<00:00, 112MB/s]
Fetching 14 files: 100%|██████████| 14/14 [00:31<00:00, 2.28s/it]
6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr Keyword arguments {'guidance_scale': 2.5} are not expected by diffusersPipeline and will be ignored.
Loading pipeline components...: 0%| | 0/6 [00:00, ?it/s]/opt/conda/envs/diffusers/lib/python3.11/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
6:35AM DBG GRPC(SG161222/Realistic_Vision_V4.0_noVAE-127.0.0.1:45251): stderr warnings.warn(
Loading pipeline components...: 100%|██████████| 6/6 [00:00<00:00, 18.79it/s]
[172.16.0.1]:35648 200 - GET /readyz
[172.16.0.1]:35646 200 - GET /readyz
[172.16.0.1]:33930 200 - GET /readyz
[172.16.0.1]:33932 200 - GET /readyz
100%|██████████| 21/21 [00:14<00:00, 1.46it/s]1:45251): stderr
6:35AM DBG Response: {"created":1703655337,"id":"f7327e36-d728-4045-aa82-6a7ae7a71445","data":[{"embedding":null,"index":0,"url":"https://ai.mydomain.com/generated-images/b644253236973.png"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.16.2.2]:49278 200 - POST /v1/images/generations
[172.16.2.70]:57350 200 - GET /generated-images/b644253236973.png
```
local-ai
logs when sending a new query with a different backend, after thediffusers
backend has been loaded once.Details
``` 6:16AM DBG Request received: 6:16AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:16AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:text F16:true Threads:4 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat ChatMessage: Completion:luna-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:33 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false} 6:16AM DBG Prompt (before templating): USER:how are you? 6:16AM DBG Template found, input modified to: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:16AM DBG Prompt (after templating): Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: USER:how are you? ASSISTANT: ### Response: 6:16AM INF Loading model 'luna-ai-llama2-uncensored.Q4_K_M.gguf' with backend llama 6:16AM DBG llama-cpp is an alias of llama-cpp 6:16AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf' 6:16AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf 6:16AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama-cpp): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000392b40 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false} 6:16AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp 6:16AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: '127.0.0.1:34401' 6:16AM DBG GRPC Service state dir: /tmp/go-processmanager2101771857 6:16AM DBG GRPC Service Started rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34401: connect: connection refused" 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stdout Server listening on 127.0.0.1:34401 [172.16.0.1]:55300 200 - GET /readyz [172.16.0.1]:55298 200 - GET /readyz 6:16AM DBG GRPC Service Ready 6:16AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:2000 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:33 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr ggml_init_cublas: found 1 CUDA devices:
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/luna-ai-llama2-uncensored.Q4_K_M.gguf (version GGUF V2)
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
.......
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ]
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 0: general.architecture str = llama
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 1: general.name str = tap-m_luna-ai-llama2-uncensored
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 2: llama.context_length u32 = 2048
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 10: general.file_type u32 = 15
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "
", "", "<0x00>", "<... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 0 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - kv 18: general.quantization_version u32 = 2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type f32: 65 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type q4_K: 193 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llama_model_loader: - type q6_K: 33 tensors 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_vocab: special tokens definition check successful ( 259/32000 ). 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: format = GGUF V2 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: arch = llama ...... 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_print_meta: LF token = 13 '<0x0A>' 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: ggml ctx size = 0.11 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: using CUDA for GPU acceleration 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: mem required = 70.42 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloading 32 repeating layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloading non-repeating layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: offloaded 33/33 layers to GPU 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr llm_load_tensors: VRAM used: 3820.93 MiB 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr .............................. 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr CUDA error 2 at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: out of memory 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr current device: 0 6:16AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:34401): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8960: !"CUDA error" [172.16.2.70]:60382 500 - POST /v1/chat/completions ```Additional context I waited some time before opening the ticket:
local-ai
has been working perfectly so far and I only discovered this since I added somediffusers
models and started playing with them. I had high hopes after discoveringSINGLE_ACTIVE_BACKEND
in the issues of the repo: while it now works when switching fromtext
toimage
(it didn't before), after invokingdiffusers
just once I get stuck as described.I also discovered the various watchdogs settings, and had high hopes as well, but as I mentioned it doesn't seem to make it right. It goes without saying that I started with no
SINGLE_ACTIVE_BACKEND
nor watchdog, and I started debugging from there, so I already tried the various combinations to no avail.I can share more details if needed, thanks again for this amazing app.