sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.63k stars 444 forks source link

Stolon-proxy stops functioning if k8s API is unavailable #790

Closed jacksontj closed 4 years ago

jacksontj commented 4 years ago

What happened: I have a stolon cluster running in a k8s cluster. This particular cluster has been running since May with no issues (great project :) ). Today I noticed that there was a short outage (lasted a few seconds) and after digging in it seems to line up with downtime of the k8s API. During this impact time the connections to the DB were closed (reported as Connection reset by peer in PG).

What you expected to happen: Stolon maintains state of the cluster through the k8s API which is understandable. But in this case I was surprised that stolon-proxy would close connections to PG with a short interruption to that API. I would expect that the stolon-proxy nodes would continue to route traffic based on their last-synced state from the k8s API until the API comes back up (and I would expect if stolon-proxy had just started it'd be broken as well).

How to reproduce it (as minimally and precisely as possible): Create stolon cluster in k8s, turn off the kube-api server

Anything else we need to know?: Seems to just be odd handling of downstream failure, seems somewhat straight-forward.

Environment:

sgotti commented 4 years ago

@jacksontj The reason is explained in the stolon architecture doc: https://github.com/sorintlab/stolon/blob/master/doc/architecture.md

jacksontj commented 4 years ago

From the docs I now see:

If the API servers are overloaded and doesn't answer in time, as explained above, the proxies will, after a timeout, close connections to the master keeper.

What timeout is this referring to? It sounds like that might be configurable if so that would suffice (as I could set it to 5m or something depending on how long my cloud providers' API server upgrades take :) )

miklezzzz commented 4 years ago

@jacksontj

proxyCheckInterval | interval to wait before next proxy check. | no | string (duration) | 5s proxyTimeout | interval where a proxy check must successfully complete or the proxy will close all connections to the master. | no | string (duration) | 15s https://github.com/sorintlab/stolon/blob/master/doc/cluster_spec.md