Closed kirthi-exe closed 4 months ago
@kirthi-exe this is espected as you are enabling SINGLE_ACTIVE_BACKEND=true
. That makes a single model to loaded - as soon as another request for another model comes in, the other are unloaded.
This is a feature that should be used with small GPUs where you actually can have only a model loaded
Greetings everyone,
I've been working on integrating LocalAI into Nextcloud via the Nextcloud app. Everything proceeded smoothly until a peculiar issue emerged: whenever I create an image, the model is cached in the vGPU memory. Similarly, when generating text, it's also cached in the vGPU. However, upon attempting to create another image, the cached model isn't utilized; instead, the identical model is saved in the cache again. Strangely, the text model gets purged from the cache.
This repetitive caching process overloads the GPU with duplicate image models during image creation, leading to eventual crashes. Despite configuring parallel requests to true, with llamacpp_parallel=1 and python_grpc_max_workers=1, allowing only one model to be cached and reused, the issue persists.
My development environment is based on a Proxmox VM, featuring 16 GB RAM, 64 CPU cores, and a NVIDIA L4 Tensor GPU with 24 GB memory. I'm utilizing the specific Image v2.6.1-cublas-cuda12-ffmpeg and the Nextcloud App available at https://apps.nextcloud.com/apps/integration_openai.
My Environment File:
My Docker Compose File:
Upon seeking assistance on Discord and being directed to report the issue as a bug, I'm reaching out for further insights and solutions. Any guidance or assistance would be greatly appreciated.