substratusai / lingo

Lightweight ML model proxy and autoscaler for kubernetes
https://www.substratus.ai
Apache License 2.0
102 stars 6 forks source link

Lingo not leader even though there is only 1 replica after long pod uptime #60

Closed samos123 closed 5 months ago

samos123 commented 6 months ago

Lingo pod with long uptime seems to lose it's leadership and unable to become leader again automatically. This happened for me on a cluster where the lingo pod had an uptime of 30+ days.

Logs I observed:

2024/01/14 04:24:09 Not leader, doing nothing                                                                   
2024/01/14 04:24:12 Not leader, doing nothing                                                                   
2024/01/14 04:24:15 Not leader, doing nothing                                                                   
2024/01/14 04:24:17 Updating queue size: deployment: mixtral-8x7b-instruct-awq-vllm, replicas: 1, newSize: 50   
2024/01/14 04:24:17 Got host: 10.100.2.4:8080, id: 7f61ce4e-17b5-4729-99c2-fa4560b8e020                         
2024/01/14 04:24:17 Proxying request to host 10.100.2.4:8080: 7f61ce4e-17b5-4729-99c2-fa4560b8e020              
2024/01/14 04:24:17 UpdateState(1, 0, 3)                                                                        
2024/01/14 04:24:18 Not leader, doing nothing                                                                   
2024/01/14 04:24:19 Running completeFunc:  7f61ce4e-17b5-4729-99c2-fa4560b8e020                                 
2024/01/14 04:24:19 Locking queue.list:  7f61ce4e-17b5-4729-99c2-fa4560b8e020                                   
2024/01/14 04:24:19 Item 7f61ce4e-17b5-4729-99c2-fa4560b8e020 inProgress: true                                  
2024/01/14 04:24:19 Did not send completed message:  7f61ce4e-17b5-4729-99c2-fa4560b8e020                       
2024/01/14 04:24:19 Finished completeFunc:  7f61ce4e-17b5-4729-99c2-fa4560b8e020                                
2024/01/14 04:24:21 Not leader, doing nothing                                                                   
2024/01/14 04:24:24 Not leader, doing nothing                                                                   
2024/01/14 04:24:27 Not leader, doing nothing                                                                   
2024/01/14 04:24:30 Not leader, doing nothing                                                                   
2024/01/14 04:24:33 Not leader, doing nothing                                                                   
2024/01/14 04:24:36 Not leader, doing nothing
2024/01/14 04:24:39 Not leader, doing nothing
2024/01/14 04:24:42 Not leader, doing nothing
2024-01-14T04:24:51Z    ERROR   Reconciler error        {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment", "Deployment": {"name":"mixtral-7x7b-vllm","namespace":"default"}, "namespace": "
default", "name": "mixtral-7x7b-vllm", "reconcileID": "765a9744-e0c3-4413-b140-d5b734cd8a2a", "error": "get: Dep
loyment.apps \"mixtral-7x7b-vllm\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024/01/14 04:24:54 Not leader, doing nothing
2024/01/14 04:24:57 Not leader, doing nothing
2024/01/14 04:25:00 Not leader, doing nothing
2024/01/14 04:25:03 Not leader, doing nothing

Pod name: lingo-7496d4c88f-987tn

Deployment details:

k get deployment lingo                                                                                130 ↵
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
lingo   1/1     1            1           73d

Symptons so far:

Not sure what other info would be relevant. I don't have enough logs to go all the way back to what introduced the Not leader, doing nothing situation. I will see if I can find older logs in Cloud Logging.

nstogner commented 6 months ago

It will take 15-30 seconds for leaders to be elected. If it never gets elected I would be curious if the Role/RoleBinding is in place to allow for Lingo to operate on kind: Lease objects in its namespace.

samos123 commented 6 months ago

I restarted the pod and it was able to become leader again:

2024-01-15T05:14:08Z    INFO    setup   Starting leader election                                                
2024-01-15T05:14:08Z    INFO    Starting EventSource    {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment", "source": "kind source: *v1.Deployment"}                                        
2024-01-15T05:14:08Z    INFO    Starting Controller     {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment"}                                                                                 2024-01-15T05:14:08Z    INFO    Starting EventSource    {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "source": "kind source: *v1.EndpointSlice"}
2024-01-15T05:14:08Z    INFO    Starting Controller     {"controller": "endpointslice", "controllerGroup": "discovery.k8s.io", "controllerKind": "EndpointSlice"}
2024-01-15T05:14:08Z    INFO    Starting EventSource    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-01-15T05:14:08Z    INFO    Starting Controller     {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}                                                                                       
2024-01-15T05:14:08Z    INFO    controller-runtime.metrics      Starting metrics server  
2024-01-15T05:14:08Z    INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8082",
 "secure": false}                                       
2024-01-15T05:14:08Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
I0115 05:14:08.265686       1 leaderelection.go:250] attempting to acquire leader lease default/lingo.substratus
.ai...                                                  
2024/01/15 05:14:08 New leader elected: lingo-7496d4c88f-987tn
2024-01-15T05:14:08Z    INFO    Starting workers        {"controller": "configmap", "controllerGroup": "", "cont
rollerKind": "ConfigMap", "worker count": 1}
2024-01-15T05:14:08Z    INFO    Starting workers        {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment", "worker count": 1}
2024-01-15T05:14:08Z    INFO    Starting workers        {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "worker count": 1}
2024/01/15 05:14:08 Updating queue size: deployment: kubernetes, replicas: 1, newSize: 50
2024/01/15 05:14:08 UpdateState(0, 0, 30)
2024/01/15 05:14:08 UpdateState(0, 0, 3)
2024/01/15 05:14:11 Not leader, doing nothing
2024/01/15 05:14:14 Not leader, doing nothing
2024/01/15 05:14:17 Not leader, doing nothing
2024/01/15 05:14:23 Not leader, doing nothing                                                                   
I0115 05:14:23.577434       1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai    
2024/01/15 05:14:23 Started leading                                                                             
2024/01/15 05:14:26 Calculating scales for all                                                                  
2024/01/15 05:14:26 Leader active requests for: kubernetes: 0                                                   
2024/01/15 05:14:26 Is leader, autoscaling                                                                      
2024/01/15 05:14:26 Average for deployment: kubernetes: 0 (ceil: 0), current wait count: 0                      
2024/01/15 05:14:26 SetDesiredScale(0)                                                                          
2024/01/15 05:14:29 Calculating scales for all                                                                  
2024/01/15 05:14:29 Leader active requests for: kubernetes: 0                                                   
2024/01/15 05:14:29 Is leader, autoscaling          

I did not have to change permissions. All I did was delete existing lingo pod. I suspect the leader retry logic isn't working as expected. I only reproduced this on a long running Lingo pod (~30 day uptime).

nstogner commented 6 months ago

What version/commit are you running?

samos123 commented 5 months ago

this still is an issue, I just encountered it again. The version used was ghcr.io/substratusai/lingo:v0.0.6

I restarted the pod again and this time I see this during initial startup:

2024-02-03T05:45:44Z    INFO    Starting EventSource    {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment", "source": "kind source: *v1.Deployment"}
2024-02-03T05:45:44Z    INFO    Starting Controller     {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment"}
2024/02/03 05:45:44 New leader elected: lingo-7496d4c88f-km8d2
2024-02-03T05:45:45Z    INFO    Starting workers        {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "worker count": 1}
2024/02/03 05:45:45 Updating queue size: deployment: kubernetes, replicas: 1, newSize: 50
2024/02/03 05:45:45 Updating queue size: deployment: mistral-7b-instruct-vllm, replicas: 1, newSize: 50
2024-02-03T05:45:45Z    INFO    Starting workers        {"controller": "deployment", "controllerGroup": "apps", 
"controllerKind": "Deployment", "worker count": 1}
2024-02-03T05:45:45Z    INFO    Starting workers        {"controller": "configmap", "controllerGroup": "", "cont
rollerKind": "ConfigMap", "worker count": 1}
2024/02/03 05:45:45 UpdateState(1, 0, 30)
2024/02/03 05:45:45 UpdateState(0, 0, 3)
2024/02/03 05:45:47 Not leader, doing nothing
2024/02/03 05:45:50 Not leader, doing nothing
2024/02/03 05:45:53 Not leader, doing nothing
2024/02/03 05:45:56 Not leader, doing nothing
2024/02/03 05:45:59 Not leader, doing nothing
2024/02/03 05:45:53 Not leader, doing nothing                                                                   
2024/02/03 05:45:56 Not leader, doing nothing                                                                   
2024/02/03 05:45:59 Not leader, doing nothing                                                                   
I0203 05:46:01.822158       1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai    
2024/02/03 05:46:01 Started leading                                                                             
2024/02/03 05:46:02 Calculating scales for all                                                                  
2024/02/03 05:46:02 Leader active requests for: kubernetes: 0                                                   
2024/02/03 05:46:02 Leader active requests for: mistral-7b-instruct-vllm: 0                                     
2024/02/03 05:46:02 Is leader, autoscaling                                                                      
2024/02/03 05:46:02 Average for deployment: kubernetes: 0 (ceil: 0), current wait count: 0                      
2024/02/03 05:46:02 SetDesiredScale(0)                                                                          
2024/02/03 05:46:02 Is leader, autoscaling         
samos123 commented 5 months ago

Found the log that shows lease was lost and afterwards it didn't retry. See logs below:

ERROR 2024-02-03T07:02:08.953269395Z [resource.labels.containerName: lingo] 2024/02/03 07:02:08 Comparing scales, current: 0, desired: 0
ERROR 2024-02-03T07:02:10.365039937Z [resource.labels.containerName: lingo] error retrieving resource lock default/lingo.substratus.ai: Get "https://10.104.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": context deadline exceeded - error from a previous attempt: unexpected EOF
INFO 2024-02-03T07:02:10.365088877Z [resource.labels.containerName: lingo] failed to renew lease default/lingo.substratus.ai: timed out waiting for the condition
ERROR 2024-02-03T07:02:10.365095237Z [resource.labels.containerName: lingo] 2024/02/03 07:02:10 Stopped leading
ERROR 2024-02-03T07:05:11.863969093Z [resource.labels.containerName: lingo] 2024/02/03 07:05:11 UpdateState(0, 0, 30)