pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

TiKV keep connecting to pd with 4-DEADLINE_EXCEEDED error #4544

Open VoxT opened 2 years ago

VoxT commented 2 years ago

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

TiDB Operator Version: version.Info{GitVersion:"v1.3.3", GitCommit:"8949b33a38396c417331859fa00c670d1a627388", GitTreeState:"clean", BuildDate:"2022-05-17T02:47:18Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
standard (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  95d

NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pd-chaos-pd-0       Bound    pvc-3b2fa30e-7c21-4411-b893-299d40912613   10Gi       RWO            standard       21m
pd-chaos-pd-1       Bound    pvc-b921799f-57d8-4998-8287-dcd0c10329cf   10Gi       RWO            standard       21m
pd-chaos-pd-2       Bound    pvc-2e51d966-07b4-4ac5-afe2-22d58d77c0fc   10Gi       RWO            standard       21m
tikv-chaos-tikv-0   Bound    pvc-347784a9-7455-4ae3-be88-dea53badaadb   100Gi      RWO            standard       20m
tikv-chaos-tikv-1   Bound    pvc-279d9124-72a7-49cf-bf64-aee1ed8419e0   100Gi      RWO            standard       20m
tikv-chaos-tikv-2   Bound    pvc-44af264d-6ec4-4a75-95c5-cb44e63bd666   100Gi      RWO            standard       20m

What's the status of the TiDB cluster pods?

NAME                               READY   STATUS    RESTARTS   AGE     IP            NODE                 NOMINATED NODE   READINESS GATES
chaos-discovery-7544f577cb-67g2t   1/1     Running   0          22m     10.244.0.73   kind-control-plane   <none>           <none>
chaos-pd-0                         1/1     Running   1          9m45s   10.244.0.91   kind-control-plane   <none>           <none>
chaos-pd-1                         1/1     Running   1          11m     10.244.0.90   kind-control-plane   <none>           <none>
chaos-pd-2                         1/1     Running   1          12m     10.244.0.89   kind-control-plane   <none>           <none>
chaos-tidb-0                       2/2     Running   0          20m     10.244.0.88   kind-control-plane   <none>           <none>
chaos-tidb-1                       2/2     Running   0          20m     10.244.0.86   kind-control-plane   <none>           <none>
chaos-tidb-2                       2/2     Running   0          20m     10.244.0.87   kind-control-plane   <none>           <none>
chaos-tikv-0                       1/1     Running   0          21m     10.244.0.84   kind-control-plane   <none>           <none>
chaos-tikv-1                       1/1     Running   0          21m     10.244.0.85   kind-control-plane   <none>           <none>
chaos-tikv-2                       1/1     Running   0          9m5s    10.244.0.92   kind-control-plane   <none>           <none>

What did you do?

I try to deploy by this instruction for tidb v5.4.1: https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-tidb-operator

but TIKV can not started, the error on TiKV pod was:

[2022/06/08 09:19:49.098 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:51.099 +00:00] [INFO] [util.rs:506] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:51.400 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:53.401 +00:00] [INFO] [util.rs:506] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:53.401 +00:00] [WARN] [client.rs:153] ["validate PD endpoints failed"] [err="Other(\"[components/pd_client/src/util.rs:539]: PD cluster failed to respond\")"]

I have precheck all the prerequisites, destroy and retry to deploy but still got the same error. Then I deploy tidb v3.0.12 and it started successfully, i try to upgrade it to v5.4.1, but guess what? The same error occurs.

What did you expect to see? The cluster should be started successfully when following the instruction.

What did you see instead? Error on TiKV when trying to connect to PD server.

VoxT commented 2 years ago

@weekface could you take a look on this

mikechengwei commented 2 years ago

Hi,do you still have questions? @VoxT

DanielZhangQD commented 2 years ago

@VoxT Could you please share your TC yaml?

gokalper commented 2 years ago

I have this same error in a kind cluster as well. I tried v5.3.0/6.0.0/6.1.0 using the advanced cluster config yaml as well as the basic config yaml from the examples folder. No luck same issue:

advanced-tidb-tikv-0 starting tikv-server ...
advanced-tidb-tikv-0 /tikv-server --pd=http://advanced-tidb-pd:2379 --advertise-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.db.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.db.svc:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml
advanced-tidb-tikv-0 
advanced-tidb-tikv-0 deprecated configuration, log-level has been moved to log.level
advanced-tidb-tikv-0 override log.level with log-level, Debug
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:79] ["Welcome to TiKV"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Release Version:   6.1.0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Edition:           Community"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Git Commit Hash:   080d086832ae5ce2495352dccaf8df5d40f30687"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Git Commit Branch: heads/refs/tags/v6.1.0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["UTC Build Time:    Unknown (env var does not exist when building)"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Enable Features:   jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Profile:           dist_release"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [mod.rs:74] ["cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={22, 5, 20, 17, 0, 16, 11, 18, 1, 7, 10, 2, 3, 8, 13, 12, 21, 23, 15, 4, 6, 14, 19, 9}"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [INFO] [mod.rs:81] ["memory limit in bytes: 69030949888, cpu cores quota: 24"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters net.core.somaxconn got 4096, expect 32768"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters net.ipv4.tcp_syncookies got 1, expect 0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters vm.swappiness got 60, expect 0"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["Using polling engine: epollex"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"grpclb\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"priority_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"weighted_target_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"pick_first\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"round_robin\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"ring_hash_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["Using ares dns resolver"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering certificate provider factory for \"file_watcher\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"cds_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_impl_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_resolver_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_manager_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.007 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.009 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.009 +00:00] [WARN] [client.rs:163] ["validate PD endpoints failed"] [err="Other(\"[components/pd_client/src/util.rs:570]: PD cluster failed to respond\")"]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.311 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:04.313 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:04.613 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:06.615 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:06.915 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:08.917 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:09.217 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
noucas commented 2 years ago

Not sure if this is the right place to comment. But I think this issue not only for tidb-operator but also for others. I also face this issue running v5.4.2/v6.1.0 on my machine.

For deeper context:

Environment: CentOS 7 TiDB/TiKV/PD version: v5.4.2/v6.1.0 Architecture: PD pod (run by kind with image v5.4.2) and TiKV pods (run by kind with image v5.4.2)

[2022/11/14 16:03:06.108 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:08.109 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:08.410 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:10.411 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:10.712 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:12.713 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:13.014 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:15.015 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:15.316 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:17.317 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]

PD config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pd0
spec:
  selector:
    matchLabels:
      name: pd0
  template:
    metadata:
      labels:
        name: pd0
    spec:
      containers:
        - name: pd0
          image: pingcap/pd:v5.4.2
          args:
            - --name=pd0
            - --client-urls=http://0.0.0.0:2379
            - --peer-urls=http://0.0.0.0:2380
            - --advertise-client-urls=http://pd0:2379
            - --advertise-peer-urls=http://pd0:2380
            - --initial-cluster=pd0=http://pd0:2380
            - --data-dir=/data/pd0
            - --log-file=/logs/pd0.log
          ports:
            - containerPort: 2379

TiKV config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tikv0
spec:
  selector:
    matchLabels:
      name: tikv0
  template:
    metadata:
      labels:
        name: tikv0
    spec:
      containers:
        - name: tikv0
          image: pingcap/tikv:v5.4.2
          args:
            - --addr=0.0.0.0:20160
            - --advertise-addr=tikv0:20160
            - --data-dir=/data/tikv0
            - --pd=pd0:2379
            - --log-file=/logs/tikv0.log
          ports:
            - containerPort: 20160

The strangest thing here is, on the same machine, when I change from using v5.4.2 to v3.0.12. The connection go through and my application worked well.

noucas commented 2 years ago

@mikechengwei @DanielZhangQD PTAL.