vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.42k stars 2.08k forks source link

Bug Report: PRS promotes replica that has not caught to old primary as new primary #14738

Closed deepthi closed 7 months ago

deepthi commented 9 months ago

Overview of the Issue

We ran into an issue where promoting the replica to primary via PlannedReparent succeeded. However, the new primary had actually not caught up to the position of the old primary. There were several thousand missing transactions.

Reproduction Steps

This is non-trivial to reproduce, it needs a decent amount of load, or some other condition to make the replica lag. One way to make the replica lag is to use it take a backup. As soon as the backup is complete, while the replica is lagged, use PRS to promote it. Necessary pre-condition: Lag should be high enough that replica cannot catchup during the time allowed (wait-replicas-timeout). Making wait-replicas-timeout small (like 1 second) will probably help to reproduce.

Binary Version

main for now, will check other versions and update.

Operating System and Environment details

any

Log Fragments

2023-12-08 22:45:20.064 
I1208 22:45:20.064349       1 rpc_replication.go:199] WaitForPosition: <redacted>
2023-12-08 22:45:49.097 
I1208 22:45:49.097332       1 rpc_replication.go:867] PromoteReplica

Note the time difference - almost 30 seconds, which is the amount of time allowed by default.

deepthi commented 9 months ago

main for now, will check other versions and update.

The bug is present on all release branches, but not in any released version. We will be fixing this on all branches. However, in addition to fixing how we handle the return values from each flavor, we should also add a check in PRS after WaitForPosition to make sure that the replica did in fact reach the desired position.

GuptaManan100 commented 9 months ago

This part https://github.com/vitessio/vitess/issues/14738#issuecomment-1848686929 has not been implemented yet. So reopening.