tarantool / vshard

The new generation of sharding based on virtual buckets
Other
100 stars 30 forks source link

Error fallback on router for faulty connections #298

Closed Gerold103 closed 2 years ago

Gerold103 commented 2 years ago

Router continues to send requests to replicas which are proven to be broken. These are orhpan nodes which didn't finish recovery/bootstrap yet, or did finish but with an error and now are broken. It also includes instances who didn't do vshard.storage.cfg, or did but didn't finish yet.

In case of not finished boot all kinds of bad behaviour is possible. The worst ones:

It seems reasonable to rely on box.info.status ~= 'running' as a sign of the node being not ready to do anything. This can be used right in the storage functions. Once they see the instance is running, the storage can reload itself to a version without these checks (so as not to call the expensive box.info when unnecessary already).

In case the storage functions are not available yet, netbox will return something nasty like:

If encounter these errors for any of vshard.storage functions or vshard.storage functions explicitly return an error about the instance being not 'running', the router must put such connections into a backoff state for some time before retrying. At the same time, the retry to another instance when see any of these errors must be automatic. Regardless of the request mode - read or write. These are not network errors, so can be freely retried.

See also #198 and #123.