mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
22.74k stars 1.73k forks source link

API endpoint for quering information about a model #1585

Open russell opened 7 months ago

russell commented 7 months ago

Is your feature request related to a problem? Please describe. I have just started trying to find and configure some models to run on my 12GB 3060.

So i go and choose a model, in this case i choose localmodelsorca-mini-v2-13b-ggmlorca_mini_v2_13b.ggmlv3.q4_1.bin.yaml from the model gallery.

The next thing i need to do is set the gpu offloading because the model gallery has none configured by default. The documentation says i should be creating a configuration that identifies the number of layers that go onto the gpu. So the question is, how many layers does my newly downloaded model have? The Hugging face page doesn't say. so i have no idea. The local-ai documentation does provide an option https://localai.io/features/gpu-acceleration/ it suggests that i should turn on debug mode and then run the model.

Describe the solution you'd like provide an API to query information about the models. so that i can call that api after i download a model and use that information to configure it.

dionysius commented 7 months ago

My workaround is just setting gpu_layers to a big number even if the model has actually less than this number. So far this has worked fine but I have only played with smaller models that fit my VRAM.

Idea: In ollama there is a form of autodetection to assign gpu and cpu layers and tries to max out the gpu. So a keyword for that in the model configuration (or generally as default behaviour) would be nice and setting actual values is for fine tuning.