vtctlclient failed to run backupShard command when:
There is a bad replica in the shard [ chance rate: 1/(n-1) where n equals to number of vttablets in a shard]
When this happens and causes backupShard to fail, all the consecutive backupShard will fail and never be able to take a backup, until humans intervene and fix the bad replica, putting risks of production data lost.
I want to re-raise this issue and will try to put up a PR for a potential fix.
Reproduction Steps
This can be consistently reproduced in K8s by making one of (or all of) the replicas in not ready status, for example, when replica(s) is Pending or Dead, the status of that replica is not_serving.
This can be observed from vtctld, log shows vtctld tried to connect to a bad vttablet.
Binary Version
This issue should be present on all vitess versions.
Operating System and Environment details
vitess deployed in k8s, and this is not a version-specific issue.
Overview of the Issue
vtctlclient failed to run backupShard command when: There is a bad replica in the shard [ chance rate: 1/(n-1) where n equals to number of vttablets in a shard]
When this happens and causes backupShard to fail, all the consecutive backupShard will fail and never be able to take a backup, until humans intervene and fix the bad replica, putting risks of production data lost.
There was a report on this: https://github.com/vitessio/vitess/issues/8908, which addressed ShardReplicationStatuses as the cause, has been closed with a fix https://github.com/vitessio/vitess/pull/8966 doesn't really fix.
I want to re-raise this issue and will try to put up a PR for a potential fix.
Reproduction Steps
This can be consistently reproduced in K8s by making one of (or all of) the replicas in not ready status, for example, when replica(s) is Pending or Dead, the status of that replica is not_serving.
This can be observed from vtctld, log shows vtctld tried to connect to a bad vttablet.
Binary Version
Operating System and Environment details
Log Fragments
No response