Open TarantoolBot opened 2 years ago
Found #355.
And now my understanding is the following. Say, we have a master and a replica.
I'll highlight: it is just my understanding of docs, the issue above and the related commit message without even looking into the code. Every statement below may be a mistake or be inaccurate.
Let's imagine the following timeline:
<downstream>.idle
is current time minus the last seen time of the replica. It goes up if there is no interations with the replica.
TBD: What if the master doesn't see heartbeats for a long time?
Next, another timeline:
<upstream>.lag
with the difference between writting to WAL on the master and receiving it on the replica — don't counting the difference in clocks.And another case, when the master doesn't serve write requests for a long time:
<upstream>.lag
with the difference between the heartbeat generation on the master and receiving it on the replica.So <upstream>.lag
is effectively a last known network lag between the master and the replica. The 'last known' remark means that it may stuck with some 'looks good' value if the network is broken.
TBD: Or not? If the network is broken, will we see it somehow differently?
I don't see good points how <upstream>.idle
works, but I can guess. On the two timelines above the last seen time of the master is updated at receiving an operation or a heartbeat (together with <upstream>.lag
).
So <upstream>.idle
is current time minus last seen time of the master. It goes up if there is no interations with the master.
TBD: What if the replica doesn't see heartbeath for a long time?
What I'm doing here? I'm looking for some 'common sense' criteria of healthy instance to implement some tracking in connectors and prevent a user from seeing stale data.
A master is bleeding edge of our data, it is not stale by the definition.
However with automatic leader election we can meet a situation, when the leader loss connectivity to quorum of instances, another leader was elected and the old one doesn't know about this.
We should mark the old leader as unhealthy if another leader was elected in the newer raft term
(it's kinda epoch; two kings can't be on a throne in one epoch). If the old leader recognizes itself as follower, it becomes healthy again (but as follower now).
A replica is okay if it is on track with master(s).
We should look at the maximal <upstream>.idle
over all upstreams. If it is below some threshold, the replica is updated regularly (at least successfully pinged by all masters).
However it does NOT reveal large latency situation. We can receive updates from a master regularly, but with a large delay. So we also should look at the maximal <upstream>.lag
over all upstreams.
We can look on it at different angles:
However, say, the instance acts as a master (and has an upstream; say the replicaset is in full mesh). The connectivity with the upstream becomes broken. So what? The instance anyway contains the most fresh data.
So, maybe:
idle
and lag
as written above.(I need to think about pitfalls here.)
Here each instance acts like a replica and a master both, so we should apply both criterias. Since our automatic leader election does not support master-master, it means that we'll effectively apply the replica's criteria here.
There are points that are applicable for any instance: as for a master as well as for a replica. At least we generally should not execute requests until the database will be fully bootstrapped (recovered from a disc or from a master). We should look at box.info.status
and if it is running
, then we at least bootstrapped.
Other status values are loading
(not fully bootstrapped), orphan
(not joined, but can try again; according to docs), hot_standby
(we shouldn't see it from a connector, because, AFAIU, it does not accpet iproto requests; but filed #2605 to make it clear). All of them are unhealthy in the sense that a connector shouldn't issue requests against them.
In fact, that's strange that an instance serves requests before a full bootstrap: it leads to problems like the following:
And extra code is needed to handle it.
Of course, some service requests should be processed before boostrap: monitoring requests, replication join requests, likely some others. But allow to access data (or app logic) in this state by default was a mistake, I think.
Take this from trainings:
box.info.replication
upstream.status
= follow
lag
< 1sidle
: при изменении статуса[^1][^1]: It means that we can ignore idle
, because it is reflected in status
. — Alexander Turenko
https://www.tarantool.io/en/doc/latest/book/replication/repl_monitoring/
Am I understand right: replication idle and replication lag are the same except that replication lag tracks only WAL writes and is not updated with heartbeats?
On https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_info/replication/ I see that both lag and idle are in the upstream object (on replica), but the downstream object (on master) has only idle.
To be honest, the documentation does not give me a predicate, which I should use to decide, whether an instance is healthy. It also does not reveal details how exactly given two metrics work, so I can't construct this predicate myself.
(Filed by @Totktonada.)