I noticed that one of my tablets in that shard is in a weird state (it's supposed to be shutting down; but isn't being killed by the operator because its replication is broken and it isn't reporting as ready), so that might have something to do with it.
Everything is fine now after deleting the bad tablet.
Prior to this, I had updated the resource requests/limits of the tablets, which had triggered a planned reparent on this shard. The tablet that was in the weird state was the old leader.
The final logs of the bad tablet were this message repeated every minute: TabletManager.ReplicationStatus()(on uscentral1c-2987211145 from ) error: no replication status. Before that, the logs were
2021-01-21 09:40:41.193 PST
"Shard watch stopped."
Info
2021-01-21 09:40:41.193 PST
"Stopping shard watch..."
Info
2021-01-21 09:40:41.193 PST
"box/60-80/uscentral1c-2987211145 [tablet] updated"
Info
2021-01-21 09:40:41.191 PST
"Going unhealthy due to replication error: no replication status"
Info
2021-01-21 09:40:41.190 PST
"Publishing state: alias:<cell:"uscentral1c" uid:2987211145 > hostname:"10.0.68.19" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"box" shard:"60-80" key_range:<start:"`" end:"\200" > type:REPLICA db_name_override:"vt_box" mysql_hostname:"10.0.68.19" mysql_port:3306 "
Info
2021-01-21 09:40:41.190 PST
"State: exiting lameduck"
Info
2021-01-21 09:40:41.190 PST
"TabletServer transition: MASTER: Serving, Jan 1, 0001 at 00:00:00 (UTC) -> REPLICA: Serving for tablet :box/60-80"
Info
2021-01-21 09:40:41.190 PST
"Replication Tracker: going into non-master mode"
Info
2021-01-21 09:40:41.190 PST
"Starting transaction id: {1611171288728593711}"
Info
2021-01-21 09:40:41.190 PST
"Immediate shutdown: rolling back now."
Info
2021-01-21 09:40:41.190 PST
"TxEngine: AcceptReadOnly"
Overview of the Issue
Reproduction Steps
Running a cluster with vitess-operator.
I noticed that one of my tablets in that shard is in a weird state (it's supposed to be shutting down; but isn't being killed by the operator because its replication is broken and it isn't reporting as ready), so that might have something to do with it.
Everything is fine now after deleting the bad tablet.
Prior to this, I had updated the resource requests/limits of the tablets, which had triggered a planned reparent on this shard. The tablet that was in the weird state was the old leader.
Binary version
Log Fragments
The final logs of the bad tablet were this message repeated every minute:
TabletManager.ReplicationStatus()(on uscentral1c-2987211145 from ) error: no replication status
. Before that, the logs were