pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

unable to start TiKV due to DNS resolution #5146

Open vanhtuan0409 opened 1 year ago

vanhtuan0409 commented 1 year ago

Bug Report

What version of Kubernetes are you using?

1.27.3 under k3s distribution

What version of TiDB Operator are you using?

1.4.5

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

local-path

What's the status of the TiDB cluster pods?

Running

What did you do?

What did you expect to see?

TiKV successfully started and able to connect to PD server

What did you see instead?

TiKV unable to start with these following logs. Altho running curl directly within TiKV pod successfully, TiKV unable to connect to PD server

[2023/07/05 04:21:36.755 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:38.756 +00:00] [INFO] [util.rs:560] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:39.057 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:41.058 +00:00] [INFO] [util.rs:560] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:41.359 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]

My hypothesis is that somehow single level DNS was unable to resolve. I tried to edit TiKV configmap to change it to ${CLUSTER_NAME}-pd.${NAMESPACE}.svc:2379 then it was successfully connect but later got reversed by the operator

Propose fix at https://github.com/pingcap/tidb-operator/pull/5145

csuzhangxc commented 1 year ago

tidb-cluster Helm chart is deprecated, and we recommend to use the TidbCluster CRD now.

vanhtuan0409 commented 1 year ago

TidbCluster CRD also suffer from this issue. The operator will create a configmap for tikv startup scripts. I am willing to contribute update, may you point me to the snippet where the operator create startup scripts configmap?

csuzhangxc commented 1 year ago

For the connectivity case, could you add the following environment for TiKV?

env:
  - name: GRPC_DNS_RESOLVER
    value: native

currently, we have two versions of StartScripts for CRDs

https://github.com/pingcap/tidb-operator/blob/3279ab51394c0e18638b6c7b1da7ac5b5a67d5bd/pkg/manager/member/startscript/v1/template.go#L251

https://github.com/pingcap/tidb-operator/blob/3279ab51394c0e18638b6c7b1da7ac5b5a67d5bd/pkg/manager/member/startscript/v2/tikv_start_script.go#L92

But if we change the StartScript directly, and then after we upgrade the TiDB Operator, as the ConfigMap will be upgraded, then all existing clusters we be restarted.