config-sync stops syncing network status after losing skupper-site-leader lock

c-kruse commented 3 weeks ago

Describe the bug I suspect that when a router holds the skupper-site-leader lock and some disruption causes it to lose that lock, that router will never attempt to reclaim the lock. This would mean the network status would stop getting updated until that router is restarted (and in v2 the vanflow controller event source would stop producing events.)

Router starts up and acquires lease

2024/09/09 17:01:38 CONFIG_SYNC: Version: v2-release-4-g03e177b
2024/09/09 17:01:38 CONFIG_SYNC: Waiting for Skupper router to be ready
2024/09/09 17:01:43 CONFIG_SYNC: Starting collector...
2024/09/09 17:01:43 CONFIG_SYNC: Starting controller loop...
I0909 17:01:43.424827       1 leaderelection.go:243] attempting to acquire leader lease ck-a/skupper-site-leader...
2024/09/09 17:01:43 CONFIG_SYNC: Waiting for informers to sync...
I0909 17:01:43.478119       1 leaderelection.go:253] successfully acquired lease ck-a/skupper-site-leader
2024/09/09 17:01:43 COLLECTOR: Leader skupper-router-d8fc996dc-8msxt starting site collection after 53.452727ms

config sync carries on, network status info is updated by kube/flow, the vanflow controller runs and sends out events, etc. All is okay.

2024/09/09 17:01:43 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=10
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=1
2024/09/09 17:01:44 INFO servicing flush component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 source=33a2e9ea-5033-4f22-939b-4c3983d31346
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=1
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=11
2024/09/09 17:01:45 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status
2024/09/09 17:01:54 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status

cluster is disrupted and a bit of chaos ensues. I think that the router container is unresponsive/unavailable and then leader election fails and config-sync stops the vanflow collector and controller? (note the different format is from klog in the k8s client.)

2024/09/09 19:31:12 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"
2024/09/09 19:31:20 INFO purged records from forgotten source component=kube.flow.statusSync source=33a2e9ea-5033-4f22-939b-4c3983d31346 count=12
2024/09/09 19:31:29 ERROR session error on discovery container component=kube.flow.statusSync error="session error: session receiver error: receive error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}"
2024/09/09 19:31:32 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status
2024/09/09 19:31:32 ERROR amqp session error component=kube.flow.controller error="session error: session receiver error: receive error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}" retryable=true
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}"
2024/09/09 19:31:32 ERROR error sending event source heartbeat component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}" priorSuccess=true timeouts=0
E0909 19:31:32.451666       1 leaderelection.go:325] error retrieving resource lock ck-a/skupper-site-leader: Get "https://172.21.0.1:443/api/v1/namespaces/ck-a/configmaps/skupper-site-leader": context deadline exceeded
I0909 19:31:32.451741       1 leaderelection.go:278] failed to renew lease ck-a/skupper-site-leader: timed out waiting for the condition
E0909 19:31:32.452942       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
2024/09/09 19:31:32 ERROR could not update network status info component=kube.flow.statusSync error="failed to update configmap: Get \"https://172.21.0.1:443/api/v1/namespaces/ck-a/configmaps/skupper-network-status\": context canceled"
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"

end of time - no more mentions of leaderelection in the logs, the configmap backing the lock is untouched since the disruption, all vanflow components remain shutdown.

How To Reproduce Steps to reproduce the behavior: TBD

Expected behavior I would expect the leader election mechanism to backoff and retry.

Environment details Observed in v2 after one of two cluster nodes was temporarily "not ready". The logic here is not obviously different than it has been in v1, but I do not have a clever way to confirm that yet.

EDIT Encountered this again, this time the router container was restarted after failing its health check and config-sync was allowed to continue running. Somewhere in there config-sync lost its lock.

c-kruse commented 3 weeks ago

Options to mitigate:

pessimistic: continue to try and respect the lock, only running status sync and controller when the lock is held, but try and re-aquire the lock when lost.

optimistic: risk multiple active status sync and controllers running by "ignoring" losing the lock. Definitely not ideal, and I do think that we'd run into weird edge cases especially around the coming and going of duplicate controller event sources, but I think would mostly work okay aside from extra noise.

c-kruse commented 3 weeks ago

For posterity: Looks like at least in v2 we're going to try a third option - restart when we lose the lock.

skupperproject / skupper

config-sync stops syncing network status after losing skupper-site-leader lock #1645