Bug Report: Polling ReplicaTracker Does Not Handle Fatal Errors Properly

mattlord commented 1 year ago

Overview of the Issue

In the polling (non-heartbeat) replication reporter, we do not currently handle fatal errors -- the only known one currently being: https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_master_fatal_error_reading_binlog

This causes us to report a replica as healthy for long periods when it is not and will not become healthy (the exception is if you do a PRS/ERS and the new primary does have the needed binary log events).

Reproduction Steps

git checkout main

## Apply this patch to enable the polling based replication reporter 
diff --git a/examples/common/scripts/vttablet-up.sh b/examples/common/scripts/vttablet-up.sh
index 5a4917429e..e4e05d6c39 100755
--- a/examples/common/scripts/vttablet-up.sh
+++ b/examples/common/scripts/vttablet-up.sh
@@ -54,9 +54,10 @@ vttablet \
  --service_map 'grpc-queryservice,grpc-tabletmanager,grpc-updatestream' \
  --pid_file $VTDATAROOT/$tablet_dir/vttablet.pid \
  --vtctld_addr http://$hostname:$vtctld_web_port/ \
- --heartbeat_enable \
+ --enable_replication_reporter=true \
  --heartbeat_interval=250ms \
  --heartbeat_on_demand_duration=5s \
+ --unhealthy_threshold=10m \
  > $VTDATAROOT/$tablet_dir/vttablet.out 2>&1 &

 # Block waiting for the tablet to be listening

make build

pushd examples/local

./101_initial_cluster.sh

vtctldclient ChangeTabletType $(vtctldclient GetTablets --tablet-type rdonly | awk '{print $1}') replica

../common/scripts/vtorc-down.sh

let replicauid=$(vtctldclient GetTablets --tablet-type replica | head -1 | awk '{print $1}' | cut -d'-' -f2)+0
let primaryuid=$(vtctldclient GetTablets --tablet-type primary | head -1 | awk '{print $1}' | cut -d'-' -f2)+0

command mysql -u root --socket=${VTDATAROOT}/vt_0000000${replicauid}/mysql.sock --binary-as-hex=false -e "stop replica io_thread"

sleep 15

mysql < ../common/insert_commerce_data.sql

sleep 15

command mysql -u root --socket=${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock --binary-as-hex=false -e "flush binary logs; purge binary logs before now();"
sleep 10
command mysql -u root --socket=${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock --binary-as-hex=false -e "flush binary logs; purge binary logs before now();"

command mysql -u root --socket=${VTDATAROOT}/vt_0000000${replicauid}/mysql.sock --binary-as-hex=false -e "start replica io_thread"

command mysql -u root --socket=${VTDATAROOT}/vt_0000000${replicauid}/mysql.sock --binary-as-hex=false -e "show replica status\G" | grep Last_IO_Error

You will see that the status is reported as healthy until we go past unhealthy_threshold:

while true; do                                                                                                                            ─╯
  curl -s "http://$(vtctldclient GetTablets --tablet-type replica | head -1 | awk '{print $5}')/debug/status_details"
  sleep 1
done

Binary Version

Version: 18.0.0-SNAPSHOT (Git revision 82bf6bec3e1134de3c7740433df554988cc438de branch 'main') built on Fri Aug  4 14:14:55 EDT 2023 by matt@pslord.local using go1.20.6 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

mattlord commented 1 year ago

I think that we want to do something like this:

diff --git a/go/mysql/replication_status.go b/go/mysql/replication_status.go
index ff06d559a5..137ec057aa 100644
--- a/go/mysql/replication_status.go
+++ b/go/mysql/replication_status.go
@@ -18,6 +18,7 @@ package mysql

 import (
    "fmt"
+   "strings"

    replicationdatapb "vitess.io/vitess/go/vt/proto/replicationdata"
    "vitess.io/vitess/go/vt/vterrors"
@@ -219,3 +220,13 @@ func (s *ReplicationStatus) FindErrantGTIDs(otherReplicaStatuses []*ReplicationS

    return diffSet, nil
 }
+
+// HasFatalError returns true if the replication status has a
+// fatal error -- which should cause the status to be reflected
+// as unhealthy, regardless of the lag. This error should also
+// be passed to the caller/user as this will require some sort
+// of manual intervention.
+func (s *ReplicationStatus) HasFatalError() bool {
+   return s.LastIOError != "" && strings.Contains(s.LastIOError, "Got fatal error 1236")
+
+}
diff --git a/go/vt/vttablet/tabletserver/repltracker/poller.go b/go/vt/vttablet/tabletserver/repltracker/poller.go
index ace01dffb2..b41172a7f1 100644
--- a/go/vt/vttablet/tabletserver/repltracker/poller.go
+++ b/go/vt/vttablet/tabletserver/repltracker/poller.go
@@ -59,7 +59,14 @@ func (p *poller) Status() (time.Duration, error) {
        if p.timeRecorded.IsZero() {
            return 0, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, "replication is not running")
        }
-       return time.Since(p.timeRecorded) + p.lag, nil
+       estimatedLag := time.Since(p.timeRecorded) + p.lag
+       if status.HasFatalError() {
+           // This is a fatal error that we will not recover from, so pass it
+           // on to the caller as this replica is not healthy regardless of
+           // the replication lag.
+           return estimatedLag, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, status.LastIOError)
+       }
+       return estimatedLag, nil
    }

    p.lag = time.Duration(status.ReplicationLagSeconds) * time.Second

deepthi commented 1 year ago

I think the argument against the proposed fix is that if such an error occurs, you may want the replica to continue serving UNTIL it reaches the unhealthy_threshold. Which gives you time to remediate the error. Though it can be argued that when an error like this occurs, there is no remediation that does not involve taking the replica offline.

vitessio / vitess