feedback: Monitoring a replica set | Tarantool

TarantoolBot commented 2 years ago

<…>istics for the other two masters, given in regard to master #1. |The primary indicators of replication health| are:

idle, the time (in seconds) since the instance received t<…>

https://www.tarantool.io/en/doc/latest/book/replication/repl_monitoring/

Am I understand right: replication idle and replication lag are the same except that replication lag tracks only WAL writes and is not updated with heartbeats?

On https://www.tarantool.io/en/doc/latest/reference/reference_lua/box_info/replication/ I see that both lag and idle are in the upstream object (on replica), but the downstream object (on master) has only idle.

To be honest, the documentation does not give me a predicate, which I should use to decide, whether an instance is healthy. It also does not reveal details how exactly given two metrics work, so I can't construct this predicate myself.

(Filed by @Totktonada.)

Totktonada commented 2 years ago

Found #355.

And now my understanding is the following. Say, we have a master and a replica.

I'll highlight: it is just my understanding of docs, the issue above and the related commit message without even looking into the code. Every statement below may be a mistake or be inaccurate.

How master tracks the regularity of communications

Let's imagine the following timeline:

The replica sends heartbeat.
The master receives it and updates the last seen time of the replica.

<downstream>.idle is current time minus the last seen time of the replica. It goes up if there is no interations with the replica.

TBD: What if the master doesn't see heartbeats for a long time?

How replica tracks a network latency

Next, another timeline:

The master executes DML/DDL and writes it to WAL.
The master sends the operation to the replica together with a time of writting to WAL.
The replica receives the operation.
The replica updates <upstream>.lag with the difference between writting to WAL on the master and receiving it on the replica — don't counting the difference in clocks.
The replica writes the operation to WAL.

And another case, when the master doesn't serve write requests for a long time:

The master finds that it don't send anything to replica for a long time. It sends a heartbeat with current time.
The replica receives it.
The replica updates <upstream>.lag with the difference between the heartbeat generation on the master and receiving it on the replica.

So <upstream>.lag is effectively a last known network lag between the master and the replica. The 'last known' remark means that it may stuck with some 'looks good' value if the network is broken.

TBD: Or not? If the network is broken, will we see it somehow differently?

How replica tracks the regularity of communications

I don't see good points how <upstream>.idle works, but I can guess. On the two timelines above the last seen time of the master is updated at receiving an operation or a heartbeat (together with <upstream>.lag).

So <upstream>.idle is current time minus last seen time of the master. It goes up if there is no interations with the master.

TBD: What if the replica doesn't see heartbeath for a long time?

What I'm doing here? I'm looking for some 'common sense' criteria of healthy instance to implement some tracking in connectors and prevent a user from seeing stale data.

Healthy master

A master is bleeding edge of our data, it is not stale by the definition.

However with automatic leader election we can meet a situation, when the leader loss connectivity to quorum of instances, another leader was elected and the old one doesn't know about this.

We should mark the old leader as unhealthy if another leader was elected in the newer raft term (it's kinda epoch; two kings can't be on a throne in one epoch). If the old leader recognizes itself as follower, it becomes healthy again (but as follower now).

Healtly replica

A replica is okay if it is on track with master(s).

We should look at the maximal <upstream>.idle over all upstreams. If it is below some threshold, the replica is updated regularly (at least successfully pinged by all masters).

However it does NOT reveal large latency situation. We can receive updates from a master regularly, but with a large delay. So we also should look at the maximal <upstream>.lag over all upstreams.

What is a replica

We can look on it at different angles:

If an instance has configured upstreams, it is effectively replica and we can apply the criteria here.

However, say, the instance acts as a master (and has an upstream; say the replicaset is in full mesh). The connectivity with the upstream becomes broken. So what? The instance anyway contains the most fresh data.

So, maybe:

The instance is in read-only state and has configured upstreams, so we check idle and lag as written above.

(I need to think about pitfalls here.)

Master-master

Here each instance acts like a replica and a master both, so we should apply both criterias. Since our automatic leader election does not support master-master, it means that we'll effectively apply the replica's criteria here.

Any instance health

There are points that are applicable for any instance: as for a master as well as for a replica. At least we generally should not execute requests until the database will be fully bootstrapped (recovered from a disc or from a master). We should look at box.info.status and if it is running, then we at least bootstrapped.

Other status values are loading (not fully bootstrapped), orphan (not joined, but can try again; according to docs), hot_standby (we shouldn't see it from a connector, because, AFAIU, it does not accpet iproto requests; but filed #2605 to make it clear). All of them are unhealthy in the sense that a connector shouldn't issue requests against them.

In fact, that's strange that an instance serves requests before a full bootstrap: it leads to problems like the following:

And extra code is needed to handle it.

Of course, some service requests should be processed before boostrap: monitoring requests, replication join requests, likely some others. But allow to access data (or app logic) in this state by default was a mistake, I think.

Mons commented 2 years ago

Take this from trainings:

box.info.replication
- upstream.status = follow
- lag < 1s
- idle: при изменении статуса[^1]
Утилизация CPU и RAM
Синхронность часов (ntp)

[^1]: It means that we can ignore idle, because it is reflected in status. — Alexander Turenko

tarantool / doc