Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the call, then this connection cannot serve any other requests until a value is returned.
Current behavior of failover fiber:
Increase replica's priority: every FAILOVER_UP_TIMEOUT failover fiber tries to connect to the replica with higher priority.
Decrease replica priority: If we're not connected to prioritized replica more than FAILOVER_DOWN_TIMEOUT, then we take another one and connect to it.
The major problem here is the assumption, that if net.box connection is_connected, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:
If user's call or failover's ping fails with error, which indicates that connection is dead (some net.box error or TimeOut), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.
Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the
call
, then this connection cannot serve any other requests until a value is returned.Current behavior of failover fiber:
FAILOVER_UP_TIMEOUT
failover fiber tries to connect to the replica with higher priority.FAILOVER_DOWN_TIMEOUT
, then we take another one and connect to it.The major problem here is the assumption, that if net.box connection
is_connected
, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:If user's
call
or failover'sping
fails with error, which indicates that connection is dead (somenet.box
error orTimeOut
), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.