Closed oliver-goetz closed 3 years ago
What does kubectl describe
say about why the cr-syncer is terminating?
The new health check seems to be the reason for restarting. But the check is not really failing, it is just taking too long. The I/O timeout seems to be 30 seconds. That's probably the same for the list call in health check, but the liveness check timeout is just 10s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 8s (x14 over 25h) kubelet Container image "gcr.io/sap-asimov-master/cr-syncer@sha256:de7733bba58c3350dbe86411617c59e45dbb74ff3eae9a5507ad2c6dc86c6d47" already present on machine
Normal Created 8s (x14 over 25h) kubelet Created container cr-syncer
Normal Started 8s (x14 over 25h) kubelet Started container cr-syncer
Warning Unhealthy 8s (x36 over 24h) kubelet Liveness probe failed: Get "http://192.168.9.9:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Killing 8s (x12 over 24h) kubelet Container cr-syncer failed liveness probe, will be restarted
Increasing timeoutSeconds
to 35
in cr-syncer deployment solves the reconnect and the startup issue.
However, the reconnect case is not running perfect yet. After I blocked the cr-syncer traffic for 4 minutes, it took 14 minutes for it to recover after I removed the block.
I suspect that exponential backoff in workqueue.DefaultControllerRateLimiter
is the reason for this. I'll test a different RateLimiter as I did here earlier this year
Thanks for the update and for testing this case! I've sent a change for timeoutSeconds
. Please send a PR to change the ratelimiter if you find something that works better. I was originally skeptical about deviating from the best practice of randomized exponential backoff, but in our case I think a fixed backoff or a shorter maximum backoff would make sense, since we have a bounded number of clients and it's more important to restore functionality quickly after a network outage.
I created a little test case to check how cr-syncer reacts on connectivity issues
The first three lines of the following log are the regular log entries when connection was available. After restarting nginx there are only downstream events left which sounds reasonable. After a couple of more events there is a i/o timeout error and cr-syncer restarts without further notice.
Cr-syncer is unable to start without connectivity to the Cloud Kubernetes API which causes a crash loop after a while
I'm not entirely sure why cr-syncer is restarting in the first place as the
Syncing key
error is not a panic but just a log entry. However, in both cases the pod restarts 30 seconds after cr-syncer started (re-)connecting so it might be directly related to that event.