pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

BR backup could raise error when PD leader changed during BR initialization #5630

Open matchge-ca opened 7 months ago

matchge-ca commented 7 months ago

Bug Report

What version of Kubernetes are you using?

What version of TiDB Operator are you using?

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

What's the status of the TiDB cluster pods?

What did you do?

  1. Follow any official document to backup a cluster using CR (for example, https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237)
  2. During the BR initialization, switch PD leader to a different pod or offline PD leader
  3. BR job will raise following error: error=\"pd address not available, ..., dial tcp: lookup <pd addr>: no such host, please check network
  4. This is most likely due to when executing BR using operator, only the PD leader address is used to discover PD cluster memberlist. The TiUP BR allows to add multiple PD addresses in the command line to prevent one PD failure during the discovery, maybe operator should also consider this. Code ref: https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237

What did you expect to see? BR is able to run when PD leader is offline during discovery

What did you see instead? BR failed and raised an error

csuzhangxc commented 7 months ago

fmt.Sprintf("--pd=%s-pd.%s:%d", backup.Spec.BR.Cluster, clusterNamespace, v1alpha1.DefaultPDClientPort) is a K8s service with all PD members as the backend.

it should resolve to other PD members in different DNS lookup calls.

kennytm commented 7 months ago

@csuzhangxc what is actually seen from the log is that we received a DNS lookup error from CDC:

pd address (cluster-pd.namespace:2379) not available, error is

Get "https://cluster-pd.namespace:2379/pd/api/v1/config/cluster-version": dial tcp: lookup cluster-pd.namespace on 100.64.0.10:53: no such host,

please check network: [BR:PD:ErrPDUpdateFailed]failed to update PD

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

csuzhangxc commented 7 months ago

@kennytm

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

NO, can not resolve DNS should often be caused by the PD pod being down (or KubeDNS having problems)