mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.75k stars 1.82k forks source link

docs: Offload/stop backend API #2370

Open t3hk0d3 opened 4 months ago

t3hk0d3 commented 4 months ago

Is your feature request related to a problem? Please describe. GPU resources are limited. Once model is loaded into GPU memory you can't use any other big model until previous backends are stopped. Currently if backend is loaded and running its impossible to unload it using API. I have to docker exec into container and kill backend process.

Describe the solution you'd like Create an endpoint to unload specific backend or all backends.

GET /v1/backends - list running backends DELETE /v1/backends/<backend_id> - unload backend DELETE /v1/backends - unload ALL backends

mudler commented 4 months ago

there are already two endpoints allowing to do this - however lacks of documentation:

https://github.com/mudler/LocalAI/blob/7efa8e75d47ab9be155ebf46c734225a7fcbdff7/core/http/routes/localai.go#L56

t3hk0d3 commented 4 months ago

@mudler Are these endpoints functional?

curl -vvvv http://192.168.5.210:8080/backend/monitor
*   Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 422 Unprocessable Entity
< Date: Mon, 27 May 2024 15:56:42 GMT
< Content-Type: application/json
< Content-Length: 65
< 
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":422,"message":"Unprocessable Entity","type":""}}%    

server log:

api-1  | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor
t3hk0d3 commented 4 months ago

Also is there an endpoint to list currently running backends?

As far as i've understood - GET /backend/monitor works with just a single model.

mudler commented 4 months ago

@mudler Are these endpoints functional?

curl -vvvv http://192.168.5.210:8080/backend/monitor
*   Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 422 Unprocessable Entity
< Date: Mon, 27 May 2024 15:56:42 GMT
< Content-Type: application/json
< Content-Length: 65
< 
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":422,"message":"Unprocessable Entity","type":""}}%    

server log:

api-1  | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor

did you pass a model?

https://github.com/mudler/LocalAI/blob/be8ffbdfcfbf4d7a848ce670e94f37858ad788ca/core/schema/localai.go#L7

t3hk0d3 commented 4 months ago

@mudler Are these endpoints functional?

curl -vvvv http://192.168.5.210:8080/backend/monitor
*   Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 422 Unprocessable Entity
< Date: Mon, 27 May 2024 15:56:42 GMT
< Content-Type: application/json
< Content-Length: 65
< 
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":422,"message":"Unprocessable Entity","type":""}}%    

server log:

api-1  | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor

did you pass a model?

https://github.com/mudler/LocalAI/blob/be8ffbdfcfbf4d7a848ce670e94f37858ad788ca/core/schema/localai.go#L7

Does model name needs to be passed in the body of GET request as json? 🤔 I've tried to pass it as query param, but without any luck.

t3hk0d3 commented 4 months ago
* Connection #0 to host 192.168.5.210 left intact
{
  "created": 1716374182,
  "object": "chat.completion",
  "id": "faedddc0-14aa-4bdb-9df4-e3ecba5ecd5d",
  "model": "gpt-4-vision-preview",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "yes</s>"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 1,
    "completion_tokens": 2,
    "total_tokens": 3
  }
}
❯ curl -vvvv -XGET -d '{"model":"gpt-4-vision-preview"}' http://192.168.5.210:8080/backend/monitor
*   Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Length: 32
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 32 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Mon, 27 May 2024 16:07:30 GMT
< Content-Type: application/json
< Content-Length: 81
< 
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":500,"message":"backend .bin is not currently loaded","type":""}}%                                                                                                                                                          ❯ nvidia-smi
Mon May 27 18:07:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   44C    P2             55W /  280W |    6493MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       688      G   /usr/lib/Xorg                                   4MiB |
|    0   N/A  N/A   1913268      C   ..._data/backend-assets/grpc/llama-cpp       6484MiB |
+-----------------------------------------------------------------------------------------+
mudler commented 4 months ago

cc @dave-gray101

jtwolfe commented 4 months ago

related #2277 also related #1498

As far as I can see it appears that watchdog just has a timer? Maybe we can implement a system that unloads the last models maybe something like...

type ModelManager interface {
    LoadModel(modelName string) error
    UnloadOldestModel() error
    CheckGPUMemory() (bool, error)
    RetryLoadModel(modelName string) error
}

and

type DefaultModelManager struct {
    ml *ModelLoader
    wd *WatchDog
    loadFailures map[string]int
}

func NewModelManager(ml *ModelLoader, wd *WatchDog) *DefaultModelManager {
    return &DefaultModelManager{
        ml: ml,
        wd: wd,
        loadFailures: make(map[string]int),
    }
}

func (mm *DefaultModelManager) LoadModel(modelName string) error {
    err := mm.ml.LoadModel(modelName, mm.ml.LoadModelFunc)
    if err != nil {
        mm.loadFailures[modelName]++
        log.Error().Err(err).Msgf("Failed to load model %s", modelName)
        return mm.RetryLoadModel(modelName)
    }
    delete(mm.loadFailures, modelName)
    return nil
}

func (mm *DefaultModelManager) UnloadOldestModel() error {
    // Implement logic to identify and unload the oldest model
    return nil
}

func (mm *DefaultModelManager) CheckGPUMemory() (bool, error) {
    // Implement logic to check GPU memory status
    return true, nil
}

func (mm *DefaultModelManager) RetryLoadModel(modelName string) error {
    if mm.loadFailures[modelName] >= 2 {
        log.Warn().Msgf("Stopping attempts to load model %s after 2 failures", modelName)
        return fmt.Errorf("failed to load model %s after 2 attempts", modelName)
    }

    available, err := mm.CheckGPUMemory()
    if err != nil || !available {
        err := mm.UnloadOldestModel()
        if err != nil {
            return err
        }
    }

    return mm.LoadModel(modelName)
}

and then attach it to watchdog?

func (wd *WatchDog) RetryModelLoad(address, modelName string) {
    mm := NewModelManager(wd.ml, wd)
    err := mm.LoadModel(modelName)
    if err != nil {
        log.Error().Err(err).Msgf("Failed to retry load model %s", modelName)
    }
}

I definitely dont know the code well enough to tell at a glance if my gpt-abuse here is the best method for this but it looks like an ok start imo