Reduce cluster/balancer-service recovery time when cluster member is removed

dpwspoon commented 7 years ago

Whatever change we make, they should be overridable by the user. Is this possible with this implementation (i.e. user sets Java System Property with the same name?) Or maybe we should expose it in InternalSystemProperty?

From: http://docs.hazelcast.org/docs/3.5/manual/html/networkconfiguration.html

connection-timeout-seconds: Defines the connection timeout. This is the maximum amount of time Hazelcast is going to try to connect to a well known member before giving up. Setting it to a too low value could mean that a member is not able to connect to a cluster. Setting it to a too high value means that member startup could slow down because of longer timeouts (e.g. when a well known member is not up). Increasing this value is recommended if you have many IPs listed and the members cannot properly build up the cluster. Its default value is 5.

I assume everything comes up fine when the cluster member comes back up?

Also, I would think that some sort of heartbeat interval modification might be better: See heartbeat in http://docs.hazelcast.org/docs/3.5/manual/html/systemproperties.html. Thoughts?

vmaraloiu commented 7 years ago

As can be seen in the logs:

←[31mcloud2.kaazing.example.com_1   |←[0m INFO  HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketException[Connection refused to address /172.27.0.5:5941]
←[35mkwic_cloud1.kaazing.example.com_1 exited with code 137
←[0m←[31mcloud2.kaazing.example.com_1   |←[0m INFO  HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to:/172.27.0.5:5941. Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1   |←[0m INFO  HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1   |←[0m INFO  HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1   |←[0m WARN  HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Removing connection to endpoint [172.27.0.5]:5941 Cause => java.net.SocketTimeoutException {null}, Error-Count: 5

After a node is killed there are three attempts to reconnect. This can be set with property: hazelcast.connection.monitor.max.faults which by default is three. The property changed in the PR is hazelcast.socket.connect.timeout.seconds is socket connection timeout and if we make it smaller the nodes will be notified earlier that one of them is dead. Yes everything comes up fine when the cluster member comes back up.

dpwspoon commented 7 years ago

Actually, after thinking this over more. I think this is the wrong way to go about this. We should be trapping the kill signal and gateway shutdown (except the ones we can't catch like -9) and removing ourselves from the cluster when the cluster node/gateway shut down. Is this implemented today, and do we have proper tests for this?

vmaraloiu commented 7 years ago

In this PR the kill signal is not trapped, but is configured this hazelcast.socket.connect.timeout.seconds property which seems to fix issue #993 as can be seen here.

kaazing / gateway

Reduce cluster/balancer-service recovery time when cluster member is removed #875