Open t3hk0d3 opened 5 months ago
there are already two endpoints allowing to do this - however lacks of documentation:
@mudler Are these endpoints functional?
curl -vvvv http://192.168.5.210:8080/backend/monitor
* Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 422 Unprocessable Entity
< Date: Mon, 27 May 2024 15:56:42 GMT
< Content-Type: application/json
< Content-Length: 65
<
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":422,"message":"Unprocessable Entity","type":""}}%
server log:
api-1 | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor
Also is there an endpoint to list currently running backends?
As far as i've understood - GET /backend/monitor
works with just a single model.
@mudler Are these endpoints functional?
curl -vvvv http://192.168.5.210:8080/backend/monitor * Trying 192.168.5.210:8080... * Connected to 192.168.5.210 (192.168.5.210) port 8080 > GET /backend/monitor HTTP/1.1 > Host: 192.168.5.210:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 422 Unprocessable Entity < Date: Mon, 27 May 2024 15:56:42 GMT < Content-Type: application/json < Content-Length: 65 < * Connection #0 to host 192.168.5.210 left intact {"error":{"code":422,"message":"Unprocessable Entity","type":""}}%
server log:
api-1 | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor
did you pass a model?
@mudler Are these endpoints functional?
curl -vvvv http://192.168.5.210:8080/backend/monitor * Trying 192.168.5.210:8080... * Connected to 192.168.5.210 (192.168.5.210) port 8080 > GET /backend/monitor HTTP/1.1 > Host: 192.168.5.210:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 422 Unprocessable Entity < Date: Mon, 27 May 2024 15:56:42 GMT < Content-Type: application/json < Content-Length: 65 < * Connection #0 to host 192.168.5.210 left intact {"error":{"code":422,"message":"Unprocessable Entity","type":""}}%
server log:
api-1 | 3:56PM WRN Client error error="Unprocessable Entity" ip=192.168.10.168 latency="82.064µs" method=GET status=422 url=/backend/monitor
did you pass a model?
Does model
name needs to be passed in the body of GET request as json? 🤔
I've tried to pass it as query param, but without any luck.
* Connection #0 to host 192.168.5.210 left intact
{
"created": 1716374182,
"object": "chat.completion",
"id": "faedddc0-14aa-4bdb-9df4-e3ecba5ecd5d",
"model": "gpt-4-vision-preview",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "yes</s>"
}
}
],
"usage": {
"prompt_tokens": 1,
"completion_tokens": 2,
"total_tokens": 3
}
}
❯ curl -vvvv -XGET -d '{"model":"gpt-4-vision-preview"}' http://192.168.5.210:8080/backend/monitor
* Trying 192.168.5.210:8080...
* Connected to 192.168.5.210 (192.168.5.210) port 8080
> GET /backend/monitor HTTP/1.1
> Host: 192.168.5.210:8080
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Length: 32
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 32 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Mon, 27 May 2024 16:07:30 GMT
< Content-Type: application/json
< Content-Length: 81
<
* Connection #0 to host 192.168.5.210 left intact
{"error":{"code":500,"message":"backend .bin is not currently loaded","type":""}}% ❯ nvidia-smi
Mon May 27 18:07:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 44C P2 55W / 280W | 6493MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 688 G /usr/lib/Xorg 4MiB |
| 0 N/A N/A 1913268 C ..._data/backend-assets/grpc/llama-cpp 6484MiB |
+-----------------------------------------------------------------------------------------+
cc @dave-gray101
related #2277 also related #1498
As far as I can see it appears that watchdog just has a timer? Maybe we can implement a system that unloads the last models maybe something like...
type ModelManager interface {
LoadModel(modelName string) error
UnloadOldestModel() error
CheckGPUMemory() (bool, error)
RetryLoadModel(modelName string) error
}
and
type DefaultModelManager struct {
ml *ModelLoader
wd *WatchDog
loadFailures map[string]int
}
func NewModelManager(ml *ModelLoader, wd *WatchDog) *DefaultModelManager {
return &DefaultModelManager{
ml: ml,
wd: wd,
loadFailures: make(map[string]int),
}
}
func (mm *DefaultModelManager) LoadModel(modelName string) error {
err := mm.ml.LoadModel(modelName, mm.ml.LoadModelFunc)
if err != nil {
mm.loadFailures[modelName]++
log.Error().Err(err).Msgf("Failed to load model %s", modelName)
return mm.RetryLoadModel(modelName)
}
delete(mm.loadFailures, modelName)
return nil
}
func (mm *DefaultModelManager) UnloadOldestModel() error {
// Implement logic to identify and unload the oldest model
return nil
}
func (mm *DefaultModelManager) CheckGPUMemory() (bool, error) {
// Implement logic to check GPU memory status
return true, nil
}
func (mm *DefaultModelManager) RetryLoadModel(modelName string) error {
if mm.loadFailures[modelName] >= 2 {
log.Warn().Msgf("Stopping attempts to load model %s after 2 failures", modelName)
return fmt.Errorf("failed to load model %s after 2 attempts", modelName)
}
available, err := mm.CheckGPUMemory()
if err != nil || !available {
err := mm.UnloadOldestModel()
if err != nil {
return err
}
}
return mm.LoadModel(modelName)
}
and then attach it to watchdog?
func (wd *WatchDog) RetryModelLoad(address, modelName string) {
mm := NewModelManager(wd.ml, wd)
err := mm.LoadModel(modelName)
if err != nil {
log.Error().Err(err).Msgf("Failed to retry load model %s", modelName)
}
}
I definitely dont know the code well enough to tell at a glance if my gpt-abuse here is the best method for this but it looks like an ok start imo
Is your feature request related to a problem? Please describe. GPU resources are limited. Once model is loaded into GPU memory you can't use any other big model until previous backends are stopped. Currently if backend is loaded and running its impossible to unload it using API. I have to
docker exec
into container and kill backend process.Describe the solution you'd like Create an endpoint to unload specific backend or all backends.
GET /v1/backends
- list running backendsDELETE /v1/backends/<backend_id>
- unload backendDELETE /v1/backends
- unload ALL backends