mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.24k stars 1.76k forks source link

diffuser backend processes stack up and hog GPU memory #2866

Open greygoo opened 2 months ago

greygoo commented 2 months ago

LocalAI version:

quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version:

Linux laboratory-vmhost 6.9.7-1-default #1 SMP PREEMPT_DYNAMIC Fri Jun 28 05:50:47 UTC 2024 (a5efffa) x86_64 x86_64 x86_64 GNU/Linux

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.02 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 On | N/A | | 0% 44C P8 11W / 165W | 982MiB / 16380MiB | 34% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

docker-compose.yaml:

services:
  api:
    #image: quay.io/go-skynet/local-ai:2.18-cpu
    #image: quay.io/go-skynet/local-ai:v2.17.0-cublas-cuda12
    #image: quay.io/go-skynet/local-ai:aio-cublas-cuda12-ffmpeg
    image: quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg
    #image: quay.io/go-skynet/local-ai:latest-cpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 10m
      retries: 20
    ports:
      - 8080:8080
    env_file:
      - .env
    environment:
      - DEBUG=true
      - MODELS_PATH=/models
      - SINGLE_ACTIVE_BACKEND=true
      - PARALLEL_REQUESTS=false
      - WATCHDOG_IDLE=true
      - WATCHDOG_BUSY=true
      - WATCHDOG_IDLE_TIMEOUT=5m
      - WATCHDOG_BUSY_TIMEOUT=5m
    volumes:
      - ./models:/models:cached
    command: ["/usr/bin/local-ai" ]
  chatgpt_telegram_bot:
    container_name: chatgpt_telegram_bot
    command: python3 bot/bot.py
    restart: always
    environment:
      - OPENAI_API_KEY=sk---1234567890
      - OPENAI_API_BASE=http://api:8080/v1
    build:
      context: "."
      dockerfile: Dockerfile
    depends_on:
      api:
        condition: service_healthy       

.env file:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=8

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Define galleries.
## models will to install will be visible in `/models/available`
#GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]'

## Default path for models
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Lets Stable Diffuser work)
COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
REBUILD=false

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
GO_TAGS=stablediffusion,tts

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

# HUGGINGFACEHUB_API_TOKEN=Token here

Describe the bug

When using models using the diffuser backend, the diffuser backen processes claim memory on the GPU, but continue running after the image is generated and thus never free the used ram.

Here a screenshot of nvtop after generating 4 images with 2 models using the diffuser backend: Screenshot_20240715_005523

To Reproduce

start LocalAI and use a image model to generate a picture using the diffuser backend on GPU. Watch the processes and GPU memory consumption with nvtop. Switch to another image model, and generate another image, switch back to the first model, and do one more image. You will see the diffuser backend processes stack up and eventually eat up all GPU memory.

Expected behavior

After an image is generated, or before another model is used, the unused diffuser backed process gets stopped, freeing the memory.

Logs

Additional context

lunamidori5 commented 1 month ago

Updated tags per @greygoo request

mudler commented 1 month ago

mm this looks something more on diffusers side of things - but I see SINGLE_ACTIVE_BACKEND=true so local-ai should have killed the backend between the calls. Maybe related to #2720 ? It's not in a tagged release yet, did you tried a (recent enough) image with the above fix?

lunamidori5 commented 1 month ago

happens on my AIO image freshly updated from the latest tag @mudler

mudler commented 1 month ago

happens on my AIO image freshly updated from the latest tag @mudler

Latest tag doesn't have the fix yet, only master images

greygoo commented 1 month ago

Will try with master and report

greygoo commented 1 month ago

Tried with master. It still fills up VRAM and keeps processes running. What I noticed, was this error that appeared after the second generation:

api-1                 | 10:07PM DBG [WatchDog] Watchdog checks for busy connections
api-1                 | 10:07PM DBG [WatchDog] Watchdog checks for idle connections
api-1                 | 10:07PM DBG [WatchDog] 127.0.0.1:35047: idle connection
api-1                 | 10:07PM WRN [WatchDog] Address 127.0.0.1:35047 is idle for too long, killing it
api-1                 | 10:07PM ERR [watchdog] error shutting down model error="model DreamShaper_8_pruned.safetensors not found" model=DreamShaper_8_pruned.safetensors
api-1                 | 10:07PM DBG [WatchDog] model shut down: 127.0.0.1:35047
api-1                 | 10:07PM DBG [WatchDog] 127.0.0.1:40201: idle connection

log:

api-1                 | 10:02PM DBG Request received: {"model":"dreamshaper","language":"","translate":false,"n":1,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"512x512","prompt":"a tree","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1                 | 10:02PM DBG Loading model: dreamshaper
api-1                 | 10:02PM DBG guessDefaultsFromFile: not a GGUF file
api-1                 | 10:02PM DBG Parameter Config: &{PredictionOptions:{Model:DreamShaper_8_pruned.safetensors Language: Translate:false N:0 TopP:0xc0015ce958 TopK:0xc0015ce960 Temperature:0xc0015ce968 Maxtokens:0xc0015ce998 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0015ce990 TypicalP:0xc0015ce988 Seed:0xc0015ce9b0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:dreamshaper F16:0xc0015ce93a Threads:0xc0015ce948 Debug:0xc000658590 Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[a tree] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0015ce980 MirostatTAU:0xc0015ce978 Mirostat:0xc0015ce970 NGPULayers:0xc0015ce9a0 MMap:0xc0015ce9a8 MMlock:0xc0015ce9a9 LowVRAM:0xc0015ce9a9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0015ce940 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType:StableDiffusionPipeline SchedulerType:k_dpmpp_2m EnableParameters:negative_prompt,num_inference_steps CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:40 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
api-1                 | 10:02PM INF Loading model 'DreamShaper_8_pruned.safetensors' with backend diffusers
api-1                 | 10:02PM DBG Stopping all backends except 'DreamShaper_8_pruned.safetensors'
api-1                 | 10:02PM DBG Loading model in memory from file: /models/DreamShaper_8_pruned.safetensors
api-1                 | 10:02PM DBG Loading Model DreamShaper_8_pruned.safetensors with gRPC (file: /models/DreamShaper_8_pruned.safetensors) (backend: diffusers): {backendString:diffusers model:DreamShaper_8_pruned.safetensors threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000335d48 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
api-1                 | 10:02PM DBG Loading external backend: /build/backend/python/diffusers/run.sh
api-1                 | 10:02PM DBG Loading GRPC Process: /build/backend/python/diffusers/run.sh
api-1                 | 10:02PM DBG GRPC Service for DreamShaper_8_pruned.safetensors will be running at: '127.0.0.1:35047'
api-1                 | 10:02PM DBG GRPC Service state dir: /tmp/go-processmanager1805089604
api-1                 | 10:02PM DBG GRPC Service Started
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stdout Initializing libbackend for build
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stdout virtualenv activated
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stdout activated virtualenv has been ensured
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr /build/backend/python/diffusers/backend_pb2_grpc.py:21: RuntimeWarning: The grpc package installed is at version 1.64.0, but the generated code in backend_pb2_grpc.py depends on grpcio>=1.64.1. Please upgrade your grpc module to grpcio>=1.64.1 or downgrade your generated code using grpcio-tools<=1.64.0. This warning will become an error in 1.65.0, scheduled for release on June 25, 2024.
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr   warnings.warn(
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr   warnings.warn(
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr   deprecate("VQEncoderOutput", "0.31", deprecation_message)
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr   deprecate("VQModel", "0.31", deprecation_message)
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr Server started. Listening on: 127.0.0.1:35047
api-1                 | 10:02PM DBG GRPC Service Ready
api-1                 | 10:02PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:DreamShaper_8_pruned.safetensors ContextSize:512 Seed:542789979 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/DreamShaper_8_pruned.safetensors Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType:StableDiffusionPipeline SchedulerType:k_dpmpp_2m CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr Loading model DreamShaper_8_pruned.safetensors...
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr Request Model: "DreamShaper_8_pruned.safetensors"
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr ContextSize: 512
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr Seed: 542789979
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr NBatch: 512
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr F16Memory: true
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr MMap: true
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr NGPULayers: 99999999
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr Threads: 8
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr ModelFile: "/models/DreamShaper_8_pruned.safetensors"
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr PipelineType: "StableDiffusionPipeline"
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr SchedulerType: "k_dpmpp_2m"
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr CUDA: true
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr 
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 120778.39it/s]ors-127.0.0.1:35047): stderr 
Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]Some weights of the model checkpoint were not used when initializing CLIPTextModel: 
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr  ['text_model.embeddings.position_ids']
Loading pipeline components...: 100%|██████████| 6/6 [00:00<00:00, 18.04it/s].0.0.1:35047): stderr 
api-1                 | 10:02PM DBG GRPC(DreamShaper_8_pruned.safetensors-127.0.0.1:35047): stderr You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
100%|██████████| 40/40 [00:03<00:00, 12.61it/s]haper_8_pruned.safetensors-127.0.0.1:35047): stderr 
api-1                 | 10:02PM DBG Response: {"created":1721426565,"id":"4170d7fe-d823-4671-8396-eb608242ca5a","data":[{"embedding":null,"index":0,"url":"http://127.0.0.1:8080/generated-images/b641998076109.png"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
  1. generate image with animagine

processes:

  47654    root   0  Compute   0%   6138MiB  37%     0%   2510MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:40201                                                                            
  47315    root   0  Compute   0%   3096MiB  19%     0%   1308MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:35047

log:

api-1                 | 10:07PM DBG Request received: {"model":"animagine","language":"","translate":false,"n":1,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"512x512","prompt":"a tree","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1                 | 10:07PM DBG Loading model: animagine
api-1                 | 10:07PM DBG guessDefaultsFromFile: not a GGUF file
api-1                 | 10:07PM DBG Parameter Config: &{PredictionOptions:{Model:dreamlike-art/dreamlike-anime-1.0 Language: Translate:false N:0 TopP:0xc00127f538 TopK:0xc00127f540 Temperature:0xc00127f548 Maxtokens:0xc00127f578 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc00127f570 TypicalP:0xc00127f568 Seed:0xc00127f590 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:animagine F16:0xc00127f510 Threads:0xc00127f528 Debug:0xc000deed40 Roles:map[] Embeddings:false Backend:diffusers TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[a tree] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc00127f560 MirostatTAU:0xc00127f558 Mirostat:0xc00127f550 NGPULayers:0xc00127f580 MMap:0xc00127f588 MMlock:0xc00127f589 LowVRAM:0xc00127f589 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc00127f520 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:true PipelineType: SchedulerType:dpm_2_a EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:true DownloadFiles:[] Description: Usage:}
api-1                 | 10:07PM INF Loading model 'dreamlike-art/dreamlike-anime-1.0' with backend diffusers
api-1                 | 10:07PM DBG Stopping all backends except 'dreamlike-art/dreamlike-anime-1.0'
api-1                 | 10:07PM DBG [single-backend] Stopping DreamShaper_8_pruned.safetensors
api-1                 | 10:07PM DBG Loading model in memory from file: /models/dreamlike-art/dreamlike-anime-1.0
api-1                 | 10:07PM DBG Loading Model dreamlike-art/dreamlike-anime-1.0 with gRPC (file: /models/dreamlike-art/dreamlike-anime-1.0) (backend: diffusers): {backendString:diffusers model:dreamlike-art/dreamlike-anime-1.0 threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0005ea248 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
api-1                 | 10:07PM DBG Loading external backend: /build/backend/python/diffusers/run.sh
api-1                 | 10:07PM DBG Loading GRPC Process: /build/backend/python/diffusers/run.sh
api-1                 | 10:07PM DBG GRPC Service for dreamlike-art/dreamlike-anime-1.0 will be running at: '127.0.0.1:40201'
api-1                 | 10:07PM DBG GRPC Service state dir: /tmp/go-processmanager366654360
api-1                 | 10:07PM DBG GRPC Service Started
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stdout Initializing libbackend for build
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stdout virtualenv activated
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stdout activated virtualenv has been ensured
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr /build/backend/python/diffusers/backend_pb2_grpc.py:21: RuntimeWarning: The grpc package installed is at version 1.64.0, but the generated code in backend_pb2_grpc.py depends on grpcio>=1.64.1. Please upgrade your grpc module to grpcio>=1.64.1 or downgrade your generated code using grpcio-tools<=1.64.0. This warning will become an error in 1.65.0, scheduled for release on June 25, 2024.
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr   warnings.warn(
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr   warnings.warn(
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr   deprecate("VQEncoderOutput", "0.31", deprecation_message)
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr /build/backend/python/diffusers/venv/lib/python3.10/site-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr   deprecate("VQModel", "0.31", deprecation_message)
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr Server started. Listening on: 127.0.0.1:40201
api-1                 | 10:07PM DBG GRPC Service Ready
api-1                 | 10:07PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:dreamlike-art/dreamlike-anime-1.0 ContextSize:512 Seed:1108918857 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/dreamlike-art/dreamlike-anime-1.0 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType:dpm_2_a CUDA:true CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr Loading model dreamlike-art/dreamlike-anime-1.0...
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr Request Model: "dreamlike-art/dreamlike-anime-1.0"
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr ContextSize: 512
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr Seed: 1108918857
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr NBatch: 512
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr MMap: true
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr NGPULayers: 99999999
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr Threads: 8
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr ModelFile: "/models/dreamlike-art/dreamlike-anime-1.0"
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr SchedulerType: "dpm_2_a"
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr CUDA: true
api-1                 | 10:07PM DBG GRPC(dreamlike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr 
Loading pipeline components...: 100%|██████████| 5/5 [00:01<00:00,  4.09it/s]7.0.0.1:40201): stderr 
100%|██████████| 15/15 [00:06<00:00,  2.42it/s]ike-art/dreamlike-anime-1.0-127.0.0.1:40201): stderr 
api-1                 | 10:07PM DBG Response: {"created":1721426864,"id":"611a5b22-58b1-4af2-937b-14870376cca1","data":[{"embedding":null,"index":0,"url":"http://127.0.0.1:8080/generated-images/b642841177737.png"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
api-1                 | 10:07PM INF Success ip=172.30.0.1 latency=14.123867925s method=POST status=200 url=/v1/images/generations
api-1                 | 10:07PM INF Success ip=172.30.0.1 latency="79.619µs" method=GET status=200 url=/generated-images/b642841177737.png
  1. select dreamshaper again, and geerate another image
  47654    root   0  Compute   0%   6138MiB  37%     0%   2510MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:40201                                                                            
  47315    root   0  Compute   0%   3096MiB  19%     0%   1308MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:35047
  47779    root   0  Compute   0%   3088MiB  19%     0%   1303MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:38795

I'll leave the log away, as it seemed identtical to the first one, however again, after the generation ran, this log entry appeared:

api-1                 | 10:12PM DBG [WatchDog] Watchdog checks for busy connections
api-1                 | 10:12PM DBG [WatchDog] Watchdog checks for idle connections
api-1                 | 10:12PM DBG [WatchDog] 127.0.0.1:40201: idle connection
api-1                 | 10:12PM WRN [WatchDog] Address 127.0.0.1:40201 is idle for too long, killing it
api-1                 | 10:12PM ERR [watchdog] error shutting down model error="model dreamlike-art/dreamlike-anime-1.0 not found" model=dreamlike-art/dreamlike-anime-1.0
api-1                 | 10:12PM DBG [WatchDog] model shut down: 127.0.0.1:40201
api-1                 | 10:12PM DBG [WatchDog] 127.0.0.1:38795: idle connection

So looks like it tries to kill the process, but doesn't manage to do so.

greygoo commented 1 month ago

Just saw one more interesting line in the logs:

api-1                 | 10:16PM DBG [WatchDog] Watchdog checks for busy connections
api-1                 | 10:16PM DBG [WatchDog] Watchdog checks for idle connections
api-1                 | 10:16PM DBG [WatchDog] 127.0.0.1:38795: idle connection
api-1                 | 10:16PM WRN [WatchDog] Address 127.0.0.1:38795 is idle for too long, killing it
api-1                 | 10:16PM DBG [WatchDog] model shut down: 127.0.0.1:38795

The process

  47779    root   0  Compute   0%   3088MiB  19%     0%   1303MiB python /build/backend/python/diffusers/backend.py --addr 127.0.0.1:38795

still is running though

mudler commented 2 weeks ago

I will try to test this locally soon - I was facing https://github.com/mudler/LocalAI/pull/3377 to start with that made unpractical to test this scenario, will report soon-ish.