Closed Serpentian closed 3 months ago
request_timeout
That looks good.
users will be able to control its value so that they have no failed requests
They will have failed requests anyway, unfortunately. Router can become isolated from the storages, or have wrong credentials, or all storages can be down, or the timeout just might be not enough. I wanted to emphasize with this, that the app code must be handling errors gracefully and not assume that failures never happen.
leave as it is now
Yes. If you would make it by default = timeout / replica count, then existing apps could break since their actual request timeout would be divided by 3 or more from what they specify (assuming people have at least 3 replicas).
we should unconditionally lower its priority
Absolutely no 🥺. If the user specified in their config a faulty replica as the highest prio, then they we should try to use it. We can't bend the config as we like permanently. I.e. if the replica with the highest prio in the config is alive, it must be used with the highest prio, 100%. But we can change how exactly we try to use it depending on the situation. Right now if a replica fails, we keep retrying it. The alternative would be that we mark it as faulty and don't use for user-requests after some number of failures, and then the failover fiber would try to ping it in the background. When failover succeeds, we make it usable again with its original prio. But until the failover ping succeeds, we don't use the connection for user requests.
As ping you can use literally the ping or maybe our vshard.storage._call('info')
. That way we can test if vshard storage is activated. Not just bare iproto.
Speaking of the backoff - if you are talking about the built-in backoff of the timeouts, it is not relevant here. It only grows and shrinks the timeout assuming the replica is alive. It doesn't change the currently used most prioritized replica.
Currently it's impossible to make several requests to an instance, connection to which hanged. Router must slightly balance replica even during
callro
, we'll call this stateless balancing.The first problem there is that request use all remaining time hoping for an answer from an instance:
https://github.com/tarantool/vshard/blob/8c6dd6289f02e0955013959d42033f0d462fb2b7/vshard/router/init.lua#L624-L627
However, if connection is not alive, but it's shown as connected, request fails with timed out error and we cannot make another one here, as we have no remaining time.
I propose to introduce new option for router calls and name it smth like
request_timeout
. It shows, how much time single request has and it must always be <= than thetimeout
. This option will be exported tocrud
and users will be able to control its value so that they have no failed requests, which is important in mission critical projects.The question here, whether we should make such value to be equal e.g.
request_timeout = timeout / <number of replicas in rs>
or leave as it is now (justrequest_timeout = timeout
). @Gerold103The second problem we have is that
replicaset
module itself doesn't change priority of replicas. So, even ifrequest_timeout
will be less thantimeout
, we'll just make several requests to a dead replica (iffailover
fiber won't wake up between requests).https://github.com/tarantool/vshard/blob/8c6dd6289f02e0955013959d42033f0d462fb2b7/vshard/router/init.lua#L710-L714
I suppose, that if request failed, we should unconditionally lower its priority. This will cause constant change of priority on dead replicasets, when e.g. privileges are configured incorrectly (however, this case is covered by the backoff procedure), but this will make requests much more solid, they'll fail much more rarely. @Gerold103, your opinion on this?