Open mattlord opened 1 year ago
I think that we want to do something like this:
diff --git a/go/mysql/replication_status.go b/go/mysql/replication_status.go
index ff06d559a5..137ec057aa 100644
--- a/go/mysql/replication_status.go
+++ b/go/mysql/replication_status.go
@@ -18,6 +18,7 @@ package mysql
import (
"fmt"
+ "strings"
replicationdatapb "vitess.io/vitess/go/vt/proto/replicationdata"
"vitess.io/vitess/go/vt/vterrors"
@@ -219,3 +220,13 @@ func (s *ReplicationStatus) FindErrantGTIDs(otherReplicaStatuses []*ReplicationS
return diffSet, nil
}
+
+// HasFatalError returns true if the replication status has a
+// fatal error -- which should cause the status to be reflected
+// as unhealthy, regardless of the lag. This error should also
+// be passed to the caller/user as this will require some sort
+// of manual intervention.
+func (s *ReplicationStatus) HasFatalError() bool {
+ return s.LastIOError != "" && strings.Contains(s.LastIOError, "Got fatal error 1236")
+
+}
diff --git a/go/vt/vttablet/tabletserver/repltracker/poller.go b/go/vt/vttablet/tabletserver/repltracker/poller.go
index ace01dffb2..b41172a7f1 100644
--- a/go/vt/vttablet/tabletserver/repltracker/poller.go
+++ b/go/vt/vttablet/tabletserver/repltracker/poller.go
@@ -59,7 +59,14 @@ func (p *poller) Status() (time.Duration, error) {
if p.timeRecorded.IsZero() {
return 0, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, "replication is not running")
}
- return time.Since(p.timeRecorded) + p.lag, nil
+ estimatedLag := time.Since(p.timeRecorded) + p.lag
+ if status.HasFatalError() {
+ // This is a fatal error that we will not recover from, so pass it
+ // on to the caller as this replica is not healthy regardless of
+ // the replication lag.
+ return estimatedLag, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, status.LastIOError)
+ }
+ return estimatedLag, nil
}
p.lag = time.Duration(status.ReplicationLagSeconds) * time.Second
I think the argument against the proposed fix is that if such an error occurs, you may want the replica to continue serving UNTIL it reaches the unhealthy_threshold. Which gives you time to remediate the error. Though it can be argued that when an error like this occurs, there is no remediation that does not involve taking the replica offline.
Overview of the Issue
In the polling (non-heartbeat) replication reporter, we do not currently handle fatal errors -- the only known one currently being: https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_master_fatal_error_reading_binlog
This causes us to report a replica as healthy for long periods when it is not and will not become healthy (the exception is if you do a PRS/ERS and the new primary does have the needed binary log events).
Reproduction Steps
You will see that the status is reported as healthy until we go past
unhealthy_threshold
:Binary Version
Operating System and Environment details
Log Fragments
No response