Duplicate model GPU overload

Greetings everyone,

I've been working on integrating LocalAI into Nextcloud via the Nextcloud app. Everything proceeded smoothly until a peculiar issue emerged: whenever I create an image, the model is cached in the vGPU memory. Similarly, when generating text, it's also cached in the vGPU. However, upon attempting to create another image, the cached model isn't utilized; instead, the identical model is saved in the cache again. Strangely, the text model gets purged from the cache.

This repetitive caching process overloads the GPU with duplicate image models during image creation, leading to eventual crashes. Despite configuring parallel requests to true, with llamacpp_parallel=1 and python_grpc_max_workers=1, allowing only one model to be cached and reused, the issue persists.

My development environment is based on a Proxmox VM, featuring 16 GB RAM, 64 CPU cores, and a NVIDIA L4 Tensor GPU with 24 GB memory. I'm utilizing the specific Image v2.6.1-cublas-cuda12-ffmpeg and the Nextcloud App available at https://apps.nextcloud.com/apps/integration_openai.

My Environment File:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=64

## Specify a different bind address (defaults to ":8080")
## ADDRESS=127.0.0.1:8080

## Default models context size
CONTEXT_SIZE=2048

## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"name":"huggingface", "url":"github:go-skynet/model-gallery/huggingface.yaml"}]

## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*

## Default path for models
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Diffusers)
COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
# BUILD_TYPE=openblas
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
# REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper
## (requires REBUILD=true)
#
GO_TAGS=stablediffusion

## Path where to store generated images
IMAGE_PATH=/tmp/generated/images

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py

### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
# LD_PRELOAD=

### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
# LLAMACPP_PARALLEL=1

### Enable to run parallel requests
PARALLEL_REQUESTS=true

### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
WATCHDOG_IDLE=true

# Enables watchdog to kill backends that are busy for too much time
WATCHDOG_BUSY=true

# Time in duration format (e.g. 1h30m) after which a backend is considered idle
WATCHDOG_IDLE_TIMEOUT=30m

# Time in duration format (e.g. 1h30m) after which a backend is considered busy
WATCHDOG_BUSY_TIMEOUT=5m

My Docker Compose File:

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:v2.6.1-cublas-cuda12-ffmpeg
    build:
      context: .
      dockerfile: Dockerfile
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: always
    ports:
      - 127.0.0.1:8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached

Upon seeking assistance on Discord and being directed to report the issue as a bug, I'm reaching out for further insights and solutions. Any guidance or assistance would be greatly appreciated.

mudler / LocalAI

Duplicate model GPU overload #1784