tarantool / vshard

The new generation of sharding based on virtual buckets
Other
99 stars 30 forks source link

Track replication upstream.idle when calling on replica #453

Closed grafin closed 1 week ago

grafin commented 11 months ago

Currently any read call, routed to replica will be executed not regarding if the replica is connected to any other instance. In my case: 1) Have a shard with 3 instances, one (A) being master (box.info.ro == false) others (B) and (C) being replicas (box.info.ro == true) 2) For some unrelated to vshard reason all appliers on replica (B) broke down not being able to recover without manual instance restart. All box.info.replication[n].upstream.status are disconnected. (n. don't use DML in triggers on ro instances, if the row is received via applier) 3) All reads, routed to replica (B) returned very old data (instance (B) was broken for several hours).

We could prevent such dirty reads, using available data, as stated in our docs:

Therefore, in a healthy replication setup, idle should never exceed replication_timeout: if it does, either the replication is lagging seriously behind, because the master is running ahead of the replica, or the network link between the instances is down.

It would be great if we could prevent those dirty reads by not routing any request to replicas (ro instances) which are disconnected from everyone, or maybe even give the ability to configure some max_replication_idle or max_replication_lag and return errors when trying to process a request on replica whose current max(replication[n].upstream.idle/lag) exceeds configured value.

Totktonada commented 11 months ago

Can't we lean just on box.info.status?

NB: My old findings regarding replication monitoring: https://github.com/tarantool/doc/issues/2604.

sergepetrenko commented 11 months ago

Can't we lean just on box.info.status?

I assume you meant upstream.status? Still, no. upstream.status may be 'follow' but the replica still may have a huge lag to this upstream. For example, when master performs a batch of changes, which replica can't apply as fast as master did.

grafin commented 11 months ago

During fixing the initial problem i get the following upstream info in my case:

id: 1
    uuid: bbbbbbbb-bbbb-4000-b000-000000000001
    lsn: 60719
    upstream:
      peer: admin@localhost:48809
      lag: 0.0002281665802002
      status: stopped
      idle: 169.65149075794
      message: Can't modify data on a read-only instance - box.cfg.read_only is true

So monitoring upstream.lag doesn't solve the issue, and as @sergepetrenko mentioned upstream.status is not very effective too. Looks like it is best to check upstream.idle.

Totktonada commented 11 months ago

Can't we lean just on box.info.status?

I assume you meant upstream.status?

No, but I got the answer :)

In my imagination tarantool should have some max_data_lag option to go into some 'stale data' state in such cases and broadcast the new state to clients.

Now, we are going to solve the same monitoring tasks in each client on the application level.

sergepetrenko commented 11 months ago

Looks like it is best to check upstream.idle.

It should be a combination of both lag and idle, whichever is greater. Or, as @Totktonada suggests, we may look at box.info.status == 'orphan', but then we have to make replicas go orphan if they see they stopped catching up with the master (this is a breaking change).

Or it can be box.info.status == 'stale_data'.

Serpentian commented 2 months ago

After discussing with @sergepetrenko, it looks like we should go with the following condition for not healthy replica. If status of upstream to master is not follow or upstream.lag is > N, where N will be constant for now, then we consider replica as not healthy. If smb asks, we may introduce additional callback to mark replica as not healthy, e.g. not to route requests to replica, if cartridge config have not been properly applied.

I see this feature as automatic lowering of priority (temporary) on router if replica is not healthy. We'll increase priority back, if node considers itself healthy. No additional cfg parameters for now. @Gerold103, your opinion on this?

R-omk commented 2 months ago

what is "priority"? if it's something like the probability of sending a request to this host, then I would prefer the probability to be zero if we agreed that too laggy replicas are not suitable for process request at all. Otherwise we will receive "flashing" data depending on which replica the request lands on.

Serpentian commented 2 months ago

what is "priority"?

Firstly we try to make request to the most prioritized replica and then to another ones, if most prioritized one failed. If you don't want for replica to serve any kind of requests, you can just disable the storage. But most users would prefer request to be non failing, so that we go to the non healthy replica, if requests to other instances fail.

Here's the conflict between consistency and availability

Gerold103 commented 2 months ago

Sounds all good to me.