Bug Report: vtctlclient failed to run backupShard command when replica(s) is not_serving

jwangace commented 9 months ago

Overview of the Issue

vtctlclient failed to run backupShard command when: There is a bad replica in the shard [ chance rate: 1/(n-1) where n equals to number of vttablets in a shard]

When this happens and causes backupShard to fail, all the consecutive backupShard will fail and never be able to take a backup, until humans intervene and fix the bad replica, putting risks of production data lost.

There was a report on this: https://github.com/vitessio/vitess/issues/8908, which addressed ShardReplicationStatuses as the cause, has been closed with a fix https://github.com/vitessio/vitess/pull/8966 doesn't really fix.

I want to re-raise this issue and will try to put up a PR for a potential fix.

Reproduction Steps

This can be consistently reproduced in K8s by making one of (or all of) the replicas in not ready status, for example, when replica(s) is Pending or Dead, the status of that replica is not_serving.

This can be observed from vtctld, log shows vtctld tried to connect to a bad vttablet.

Binary Version

This issue should be present on all vitess versions.

Operating System and Environment details

vitess deployed in k8s, and this is not a version-specific issue.

Log Fragments

No response

shlomi-noach commented 9 months ago

Addressed by https://github.com/vitessio/vitess/pull/14604

shlomi-noach commented 9 months ago

fixed by https://github.com/vitessio/vitess/pull/14604

vitessio / vitess