Closed grafin closed 1 week ago
Can't we lean just on box.info.status
?
NB: My old findings regarding replication monitoring: https://github.com/tarantool/doc/issues/2604.
Can't we lean just on box.info.status?
I assume you meant upstream.status
?
Still, no. upstream.status
may be 'follow' but the replica still may have a huge lag to this upstream. For example, when master performs a batch of changes, which replica can't apply as fast as master did.
During fixing the initial problem i get the following upstream info in my case:
id: 1
uuid: bbbbbbbb-bbbb-4000-b000-000000000001
lsn: 60719
upstream:
peer: admin@localhost:48809
lag: 0.0002281665802002
status: stopped
idle: 169.65149075794
message: Can't modify data on a read-only instance - box.cfg.read_only is true
So monitoring upstream.lag
doesn't solve the issue, and as @sergepetrenko mentioned upstream.status
is not very effective too. Looks like it is best to check upstream.idle
.
Can't we lean just on box.info.status?
I assume you meant
upstream.status
?
No, but I got the answer :)
In my imagination tarantool should have some max_data_lag
option to go into some 'stale data' state in such cases and broadcast the new state to clients.
Now, we are going to solve the same monitoring tasks in each client on the application level.
Looks like it is best to check upstream.idle.
It should be a combination of both lag and idle, whichever is greater. Or, as @Totktonada suggests, we may look at box.info.status == 'orphan'
, but then we have to make replicas go orphan if they see they stopped catching up with the master (this is a breaking change).
Or it can be box.info.status == 'stale_data'
.
After discussing with @sergepetrenko, it looks like we should go with the following condition for not healthy replica. If status of upstream to master is not follow
or upstream.lag is > N, where N will be constant for now, then we consider replica as not healthy. If smb asks, we may introduce additional callback to mark replica as not healthy, e.g. not to route requests to replica, if cartridge config have not been properly applied.
I see this feature as automatic lowering of priority (temporary) on router if replica is not healthy. We'll increase priority back, if node considers itself healthy. No additional cfg parameters for now. @Gerold103, your opinion on this?
what is "priority"? if it's something like the probability of sending a request to this host, then I would prefer the probability to be zero if we agreed that too laggy replicas are not suitable for process request at all. Otherwise we will receive "flashing" data depending on which replica the request lands on.
what is "priority"?
Firstly we try to make request to the most prioritized replica and then to another ones, if most prioritized one failed. If you don't want for replica to serve any kind of requests, you can just disable the storage. But most users would prefer request to be non failing, so that we go to the non healthy replica, if requests to other instances fail.
Here's the conflict between consistency and availability
Sounds all good to me.
Currently any read call, routed to replica will be executed not regarding if the replica is connected to any other instance. In my case: 1) Have a shard with 3 instances, one (A) being master (
box.info.ro == false
) others (B) and (C) being replicas (box.info.ro == true
) 2) For some unrelated to vshard reason all appliers on replica (B) broke down not being able to recover without manual instance restart. Allbox.info.replication[n].upstream.status
aredisconnected
. (n. don't use DML in triggers on ro instances, if the row is received via applier) 3) All reads, routed to replica (B) returned very old data (instance (B) was broken for several hours).We could prevent such dirty reads, using available data, as stated in our docs:
It would be great if we could prevent those dirty reads by not routing any request to replicas (ro instances) which are disconnected from everyone, or maybe even give the ability to configure some
max_replication_idle
ormax_replication_lag
and return errors when trying to process a request on replica whose currentmax(replication[n].upstream.idle/lag)
exceeds configured value.