skupperproject / skupper

Skupper is an implementation of a Virtual Application Network, enabling rich hybrid cloud communication.
http://skupper.io
Apache License 2.0
584 stars 72 forks source link

config-sync stops syncing network status after losing skupper-site-leader lock #1645

Open c-kruse opened 3 weeks ago

c-kruse commented 3 weeks ago

Describe the bug I suspect that when a router holds the skupper-site-leader lock and some disruption causes it to lose that lock, that router will never attempt to reclaim the lock. This would mean the network status would stop getting updated until that router is restarted (and in v2 the vanflow controller event source would stop producing events.)

2024/09/09 17:01:43 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=10
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=1
2024/09/09 17:01:44 INFO servicing flush component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 source=33a2e9ea-5033-4f22-939b-4c3983d31346
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=1
2024/09/09 17:01:44 INFO record message sent component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 record_count=11
2024/09/09 17:01:45 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status
2024/09/09 17:01:54 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status
2024/09/09 19:31:12 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"
2024/09/09 19:31:20 INFO purged records from forgotten source component=kube.flow.statusSync source=33a2e9ea-5033-4f22-939b-4c3983d31346 count=12
2024/09/09 19:31:29 ERROR session error on discovery container component=kube.flow.statusSync error="session error: session receiver error: receive error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}"
2024/09/09 19:31:32 INFO updating network status info component=kube.flow.statusSync configmap=skupper-network-status
2024/09/09 19:31:32 ERROR amqp session error component=kube.flow.controller error="session error: session receiver error: receive error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}" retryable=true
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}"
2024/09/09 19:31:32 ERROR error sending event source heartbeat component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send error: *Error{Condition: amqp:resource-limit-exceeded, Description: local-idle-timeout expired, Info: map[]}" priorSuccess=true timeouts=0
E0909 19:31:32.451666       1 leaderelection.go:325] error retrieving resource lock ck-a/skupper-site-leader: Get "https://172.21.0.1:443/api/v1/namespaces/ck-a/configmaps/skupper-site-leader": context deadline exceeded
I0909 19:31:32.451741       1 leaderelection.go:278] failed to renew lease ck-a/skupper-site-leader: timed out waiting for the condition
E0909 19:31:32.452942       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
2024/09/09 19:31:32 ERROR could not update network status info component=kube.flow.statusSync error="failed to update configmap: Get \"https://172.21.0.1:443/api/v1/namespaces/ck-a/configmaps/skupper-network-status\": context canceled"
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"
2024/09/09 19:31:32 ERROR error sending event source beacon component=vanflow.eventsource.manager instance=33a2e9ea-5033-4f22-939b-4c3983d31346 error="send timed out"

How To Reproduce Steps to reproduce the behavior: TBD

Expected behavior I would expect the leader election mechanism to backoff and retry.

Environment details Observed in v2 after one of two cluster nodes was temporarily "not ready". The logic here is not obviously different than it has been in v1, but I do not have a clever way to confirm that yet.

EDIT Encountered this again, this time the router container was restarted after failing its health check and config-sync was allowed to continue running. Somewhere in there config-sync lost its lock.

c-kruse commented 3 weeks ago

Options to mitigate:

pessimistic: continue to try and respect the lock, only running status sync and controller when the lock is held, but try and re-aquire the lock when lost.

optimistic: risk multiple active status sync and controllers running by "ignoring" losing the lock. Definitely not ideal, and I do think that we'd run into weird edge cases especially around the coming and going of duplicate controller event sources, but I think would mostly work okay aside from extra noise.

c-kruse commented 3 weeks ago

For posterity: Looks like at least in v2 we're going to try a third option - restart when we lose the lock.