vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.42k stars 2.08k forks source link

Bug Report: uncaught vtctl/backup.go:124 panic during vtctlclient BackupShard (v8.0.0) #7341

Open jmoldow opened 3 years ago

jmoldow commented 3 years ago

Overview of the Issue

I0121 18:01:36.202788      52 main.go:64] I0121 18:01:36.202145 reparent.go:63] Gathering tablet replication status for: [Tablet{uscentral1c-2921556232} Tablet{uscentral1c-2987211145} Tablet{uscentral1c-1370310901}]
E0121 18:01:36.204098      52 main.go:67] remote error: rpc error: code = Unknown desc = uncaught vtctl panic: runtime error: invalid memory address or nil pointer dereference, /usr/local/go/src/runtime/panic.go:199 (0x446bdb);/usr/local/go/src/runtime/signal_unix.go:394 (0x446a18);/vt/src/vitess.io/vitess/go/vt/vtctl/backup.go:124 (0x1824a0c);/vt/src/vitess.io/vitess/go/vt/vtctl/vtctl.go:3240 (0x184b503);/vt/src/vitess.io/vitess/go/vt/vtctl/grpcvtctlserver/server.go:77 (0x1a8c4f4);/vt/src/vitess.io/vitess/go/vt/proto/vtctlservice/vtctlservice.pb.go:122 (0x1a8bb48);/go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:1199 (0x8c05bd);/go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:1279 (0x8c1f0f);/go/pkg/mod/google.golang.org/grpc@v1.24.0/server.go:710 (0x8cf59a);/usr/local/go/src/runtime/asm_amd64.s:1357 (0x4612f0);
command terminated with exit code 1

Reproduction Steps

Running a cluster with vitess-operator.

$ vtctlclient BackupShard box/60-80

I noticed that one of my tablets in that shard is in a weird state (it's supposed to be shutting down; but isn't being killed by the operator because its replication is broken and it isn't reporting as ready), so that might have something to do with it.

Everything is fine now after deleting the bad tablet.

Prior to this, I had updated the resource requests/limits of the tablets, which had triggered a planned reparent on this shard. The tablet that was in the weird state was the old leader.

Binary version

Log Fragments

The final logs of the bad tablet were this message repeated every minute: TabletManager.ReplicationStatus()(on uscentral1c-2987211145 from ) error: no replication status. Before that, the logs were

2021-01-21 09:40:41.193 PST
"Shard watch stopped."
Info
2021-01-21 09:40:41.193 PST
"Stopping shard watch..."
Info
2021-01-21 09:40:41.193 PST
"box/60-80/uscentral1c-2987211145 [tablet] updated"
Info
2021-01-21 09:40:41.191 PST
"Going unhealthy due to replication error: no replication status"
Info
2021-01-21 09:40:41.190 PST
"Publishing state: alias:<cell:"uscentral1c" uid:2987211145 > hostname:"10.0.68.19" port_map:<key:"grpc" value:15999 > port_map:<key:"vt" value:15000 > keyspace:"box" shard:"60-80" key_range:<start:"`" end:"\200" > type:REPLICA db_name_override:"vt_box" mysql_hostname:"10.0.68.19" mysql_port:3306 "
Info
2021-01-21 09:40:41.190 PST
"State: exiting lameduck"
Info
2021-01-21 09:40:41.190 PST
"TabletServer transition: MASTER: Serving, Jan 1, 0001 at 00:00:00 (UTC) -> REPLICA: Serving for tablet :box/60-80"
Info
2021-01-21 09:40:41.190 PST
"Replication Tracker: going into non-master mode"
Info
2021-01-21 09:40:41.190 PST
"Starting transaction id: {1611171288728593711}"
Info
2021-01-21 09:40:41.190 PST
"Immediate shutdown: rolling back now."
Info
2021-01-21 09:40:41.190 PST
"TxEngine: AcceptReadOnly"
jmoldow commented 3 years ago

This is the only time this has happened, backups have been really easy and error-free to perform in general.

And as noted, the problem was resolved after I deleted the broken tablet in that shard.