tarantool / vshard

The new generation of sharding based on virtual buckets
Other
99 stars 30 forks source link

Failover must change priority based on ping #483

Closed Serpentian closed 3 months ago

Serpentian commented 3 months ago

Currently failover pings all nodes on every step. However, this ping doesn't affect instance priority at all, it just resets connection, if they "hang". This was done as if user returns too big values from the call, then this connection cannot serve any other requests until a value is returned.

Current behavior of failover fiber:

  1. Increase replica's priority: every FAILOVER_UP_TIMEOUT failover fiber tries to connect to the replica with higher priority.
  2. Decrease replica priority: If we're not connected to prioritized replica more than FAILOVER_DOWN_TIMEOUT, then we take another one and connect to it.

The major problem here is the assumption, that if net.box connection is_connected, then everything is all right, however, in real life it's not like that. When we cannot ping replica, we should temporary lower replica priority. This may be done as follows:

If user's call or failover's ping fails with error, which indicates that connection is dead (some net.box error or TimeOut), then we increase the counter of failed requests to this replica. For this counter we introduce constant variable, which will be 3 for now. If 3 consequent requests fail, then we temporary decrease the priority of such replica.