Closed samos123 closed 5 months ago
It will take 15-30 seconds for leaders to be elected. If it never gets elected I would be curious if the Role/RoleBinding is in place to allow for Lingo to operate on kind: Lease
objects in its namespace.
I restarted the pod and it was able to become leader again:
2024-01-15T05:14:08Z INFO setup Starting leader election
2024-01-15T05:14:08Z INFO Starting EventSource {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment", "source": "kind source: *v1.Deployment"}
2024-01-15T05:14:08Z INFO Starting Controller {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment"} 2024-01-15T05:14:08Z INFO Starting EventSource {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "source": "kind source: *v1.EndpointSlice"}
2024-01-15T05:14:08Z INFO Starting Controller {"controller": "endpointslice", "controllerGroup": "discovery.k8s.io", "controllerKind": "EndpointSlice"}
2024-01-15T05:14:08Z INFO Starting EventSource {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-01-15T05:14:08Z INFO Starting Controller {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-01-15T05:14:08Z INFO controller-runtime.metrics Starting metrics server
2024-01-15T05:14:08Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8082",
"secure": false}
2024-01-15T05:14:08Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
I0115 05:14:08.265686 1 leaderelection.go:250] attempting to acquire leader lease default/lingo.substratus
.ai...
2024/01/15 05:14:08 New leader elected: lingo-7496d4c88f-987tn
2024-01-15T05:14:08Z INFO Starting workers {"controller": "configmap", "controllerGroup": "", "cont
rollerKind": "ConfigMap", "worker count": 1}
2024-01-15T05:14:08Z INFO Starting workers {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment", "worker count": 1}
2024-01-15T05:14:08Z INFO Starting workers {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "worker count": 1}
2024/01/15 05:14:08 Updating queue size: deployment: kubernetes, replicas: 1, newSize: 50
2024/01/15 05:14:08 UpdateState(0, 0, 30)
2024/01/15 05:14:08 UpdateState(0, 0, 3)
2024/01/15 05:14:11 Not leader, doing nothing
2024/01/15 05:14:14 Not leader, doing nothing
2024/01/15 05:14:17 Not leader, doing nothing
2024/01/15 05:14:23 Not leader, doing nothing
I0115 05:14:23.577434 1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai
2024/01/15 05:14:23 Started leading
2024/01/15 05:14:26 Calculating scales for all
2024/01/15 05:14:26 Leader active requests for: kubernetes: 0
2024/01/15 05:14:26 Is leader, autoscaling
2024/01/15 05:14:26 Average for deployment: kubernetes: 0 (ceil: 0), current wait count: 0
2024/01/15 05:14:26 SetDesiredScale(0)
2024/01/15 05:14:29 Calculating scales for all
2024/01/15 05:14:29 Leader active requests for: kubernetes: 0
2024/01/15 05:14:29 Is leader, autoscaling
I did not have to change permissions. All I did was delete existing lingo pod. I suspect the leader retry logic isn't working as expected. I only reproduced this on a long running Lingo pod (~30 day uptime).
What version/commit are you running?
this still is an issue, I just encountered it again. The version used was ghcr.io/substratusai/lingo:v0.0.6
I restarted the pod again and this time I see this during initial startup:
2024-02-03T05:45:44Z INFO Starting EventSource {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment", "source": "kind source: *v1.Deployment"}
2024-02-03T05:45:44Z INFO Starting Controller {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment"}
2024/02/03 05:45:44 New leader elected: lingo-7496d4c88f-km8d2
2024-02-03T05:45:45Z INFO Starting workers {"controller": "endpointslice", "controllerGroup": "disc
overy.k8s.io", "controllerKind": "EndpointSlice", "worker count": 1}
2024/02/03 05:45:45 Updating queue size: deployment: kubernetes, replicas: 1, newSize: 50
2024/02/03 05:45:45 Updating queue size: deployment: mistral-7b-instruct-vllm, replicas: 1, newSize: 50
2024-02-03T05:45:45Z INFO Starting workers {"controller": "deployment", "controllerGroup": "apps",
"controllerKind": "Deployment", "worker count": 1}
2024-02-03T05:45:45Z INFO Starting workers {"controller": "configmap", "controllerGroup": "", "cont
rollerKind": "ConfigMap", "worker count": 1}
2024/02/03 05:45:45 UpdateState(1, 0, 30)
2024/02/03 05:45:45 UpdateState(0, 0, 3)
2024/02/03 05:45:47 Not leader, doing nothing
2024/02/03 05:45:50 Not leader, doing nothing
2024/02/03 05:45:53 Not leader, doing nothing
2024/02/03 05:45:56 Not leader, doing nothing
2024/02/03 05:45:59 Not leader, doing nothing
2024/02/03 05:45:53 Not leader, doing nothing
2024/02/03 05:45:56 Not leader, doing nothing
2024/02/03 05:45:59 Not leader, doing nothing
I0203 05:46:01.822158 1 leaderelection.go:260] successfully acquired lease default/lingo.substratus.ai
2024/02/03 05:46:01 Started leading
2024/02/03 05:46:02 Calculating scales for all
2024/02/03 05:46:02 Leader active requests for: kubernetes: 0
2024/02/03 05:46:02 Leader active requests for: mistral-7b-instruct-vllm: 0
2024/02/03 05:46:02 Is leader, autoscaling
2024/02/03 05:46:02 Average for deployment: kubernetes: 0 (ceil: 0), current wait count: 0
2024/02/03 05:46:02 SetDesiredScale(0)
2024/02/03 05:46:02 Is leader, autoscaling
Found the log that shows lease was lost and afterwards it didn't retry. See logs below:
ERROR 2024-02-03T07:02:08.953269395Z [resource.labels.containerName: lingo] 2024/02/03 07:02:08 Comparing scales, current: 0, desired: 0
ERROR 2024-02-03T07:02:10.365039937Z [resource.labels.containerName: lingo] error retrieving resource lock default/lingo.substratus.ai: Get "https://10.104.0.1:443/apis/coordination.k8s.io/v1/namespaces/default/leases/lingo.substratus.ai": context deadline exceeded - error from a previous attempt: unexpected EOF
INFO 2024-02-03T07:02:10.365088877Z [resource.labels.containerName: lingo] failed to renew lease default/lingo.substratus.ai: timed out waiting for the condition
ERROR 2024-02-03T07:02:10.365095237Z [resource.labels.containerName: lingo] 2024/02/03 07:02:10 Stopped leading
ERROR 2024-02-03T07:05:11.863969093Z [resource.labels.containerName: lingo] 2024/02/03 07:05:11 UpdateState(0, 0, 30)
Lingo pod with long uptime seems to lose it's leadership and unable to become leader again automatically. This happened for me on a cluster where the lingo pod had an uptime of 30+ days.
Logs I observed:
Pod name:
lingo-7496d4c88f-987tn
Deployment details:
Symptons so far:
Not sure what other info would be relevant. I don't have enough logs to go all the way back to what introduced the Not leader, doing nothing situation. I will see if I can find older logs in Cloud Logging.