Add ETCDv3 client more statility

sorintlab / stolon

PostgreSQL cloud native High Availability and more.

https://talk.stolon.io

Apache License 2.0

4.66k stars 447 forks source link

Add ETCDv3 client more statility #821

Closed itsystem closed 3 years ago

itsystem commented 3 years ago

Add ETCDv3 more statility:

enable keepalive checks you must react on returned err code etcd3Lib functions Or enable keepalive for etcd3Lib for himself react on etcd cluster changes issue: https://github.com/sorintlab/stolon/issues/794
update client libs to v.3.4.14
grow proxy timeout to 20 sec, for skip proxy connections terminate when etcd cluster change leader When etcd cluster changes leader node, stolon-proxy resets all connections but other components not failed. Grow DefaultProxyTimeout 15-->20 sec can change this behaviour.

itsystem commented 3 years ago

I think tests have failed not by code reasons Successfully tagged stolon:master-pg11 /stolon/examples/kubernetes /stolon error: unable to recognize "role.yaml": Get https://localhost:8443/api?timeout=32s: dial tcp 127.0.0.1:8443: connect: connection refused

deepdivenow commented 3 years ago

Hi I have writed video's for you

This video more short and you can see how two sentinel versions work in one situation when master node is disappeared https://youtu.be/4xto0y1EfHw

This more long video only v0.16.0 in this video all stolon components in failed state before i restarted theirs. https://youtu.be/HJZbeR2BGFs

You can reproduce this situation ONLY if etcd leader node gone without any packet by net, and reply nothing on any other packers like black hole.

deepdivenow commented 3 years ago

Are you saying that etcd takes too much time to elect a new leader in relation to the stolon proxy timeout and you want to avoid the timeout right? If so, is the etcd leader election time something depending on multiple environment conditions or something with a defined upper bound? We should understand if increasing the timeout works only on your enviroment or also everywhere (and probably make the proxy timeout configurable, there's already an open issue for this).

If this timeout be configurable is better because leader election time is configurable in etcd cluster. And fixed timeout can't satisfy all env.

sgotti commented 3 years ago

@deepdivenow I created two PRs in place of this: #827 #828

If this timeout be configurable is better because leader election time is configurable in etcd cluster. And fixed timeout can't satisfy all env.

This was already implemented in #756

Thanks!