Closed vmaraloiu closed 7 years ago
As can be seen in the logs:
←[31mcloud2.kaazing.example.com_1 |←[0m INFO HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketException[Connection refused to address /172.27.0.5:5941]
←[35mkwic_cloud1.kaazing.example.com_1 exited with code 137
←[0m←[31mcloud2.kaazing.example.com_1 |←[0m INFO HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to:/172.27.0.5:5941. Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1 |←[0m INFO HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1 |←[0m INFO HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Could not connect to: /172.27.0.5:5941.Reason: SocketTimeoutException[null]
←[31mcloud2.kaazing.example.com_1 |←[0m WARN HAZELCAST: [Member [172.27.0.7]:5941 - e6053ff1-038c-4b60-b1c3-9684cfc35b83 this] - [172.27.0.7]:5941 [kzha] [3.7.4] Removing connection to endpoint [172.27.0.5]:5941 Cause => java.net.SocketTimeoutException {null}, Error-Count: 5
After a node is killed there are three attempts to reconnect. This can be set with property: hazelcast.connection.monitor.max.faults
which by default is three.
The property changed in the PR is hazelcast.socket.connect.timeout.seconds
is socket connection timeout and if we make it smaller the nodes will be notified earlier that one of them is dead.
Yes everything comes up fine when the cluster member comes back up.
Actually, after thinking this over more. I think this is the wrong way to go about this. We should be trapping the kill signal and gateway shutdown (except the ones we can't catch like -9) and removing ourselves from the cluster when the cluster node/gateway shut down. Is this implemented today, and do we have proper tests for this?
Whatever change we make, they should be overridable by the user. Is this possible with this implementation (i.e. user sets Java System Property with the same name?) Or maybe we should expose it in InternalSystemProperty?
From: http://docs.hazelcast.org/docs/3.5/manual/html/networkconfiguration.html
I assume everything comes up fine when the cluster member comes back up?
Also, I would think that some sort of heartbeat interval modification might be better: See heartbeat in http://docs.hazelcast.org/docs/3.5/manual/html/systemproperties.html. Thoughts?