pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

missing tikv and tidb statefulset definitions #4667

Closed davidp1404 closed 1 year ago

davidp1404 commented 2 years ago

Hello, I've deployed tidb with this script, aligned with the documentation, on a k8s instance v1.22.1

#!/bin/bash
CRD_VERSION="v1.3.6"
CHART_VERSION="v1.3.6"
NS="tidb-admin"
kubectl create -f https://raw.githubusercontent.com/pingcap/tidb-operator/${CRD_VERSION}/manifests/crd.yaml
helm repo add pingcap https://charts.pingcap.org/
helm upgrade --install --namespace ${NS} tidb-operator pingcap/tidb-operator --version ${CHART_VERSION}

Everything works fine and pods are running:

$ k -n tidb-admin get pod
NAME                                       READY   STATUS    RESTARTS   AGE
tidb-controller-manager-6578f48796-ngvhr   1/1     Running   0          25h
tidb-scheduler-854dc7f69f-gg7dg            2/2     Running   0          25h

Then I try to deploy my sample tidb instance in my app namespace: kubectl -n myapp apply -f https://raw.githubusercontent.com/pingcap/tidb-operator/master/examples/basic/tidb-cluster.yaml

But the tidb cluster never become ready and I was unable to find any error, detecting that, for any reason, the "basic-tikv" statefulset is not defined.

$ k get tidbclusters.pingcap.com basic
NAME    READY   PD    STORAGE   READY   DESIRE   TIKV   STORAGE   READY   DESIRE   TIDB   READY   DESIRE   AGE
basic   False         1Gi       1       1               1Gi               1                       1        5m20s
$ k get pod -l app.kubernetes.io/managed-by=tidb-operator
NAME                               READY   STATUS    RESTARTS   AGE
basic-discovery-6dff9bd7bf-6495n   1/1     Running   0          6m28s
basic-pd-0                         1/1     Running   0          6m28s
$ k get sts -l app.kubernetes.io/managed-by=tidb-operator
NAME       READY   AGE
basic-pd   1/1     6m49s
$ k get pvc -l app.kubernetes.io/managed-by=tidb-operator
NAME               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
pd-basic-pd-0      Bound    pvc-99456a7d-4236-473d-b34e-57b2aa87f6d1   1Gi        RWO           block-sc            44m

The controller is only reporting errors like this:

E0804 16:26:05.774418       1 tidb_cluster_controller.go:126] TidbCluster: myapp/basic, sync failed TidbCluster: myapp/basic .Status.PD.Synced = false, can't failover, requeuing
E0804 16:26:35.762262       1 pd_member_manager.go:190] failed to sync TidbCluster: [myapp/basic]'s status, error: Get http://basic-pd.myapp:2379/pd/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I've removed our resourcequota/limitrange definition but still not working, also with standard psp:privileged enabled in the namespace it doesn't work. I'd appreciate your help to discover what is wrong with our setup, it is a stable development k8s instance running for years without any problems identified so far.

davidp1404 commented 2 years ago

More logs just in case they help:

$ k logs basic-pd-0
Server:         169.254.25.10
Address:        169.254.25.10#53

Name:   basic-pd-0.basic-pd-peer.myapp.svc.cluster.local
Address: 10.240.113.2

nslookup domain basic-pd-0.basic-pd-peer.myapp.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=basic-pd-0 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-0.basic-pd-peer.myapp.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-0.basic-pd-peer.myapp.svc:2379 --config=/etc/pd/pd.toml --initial-cluster=basic-pd-0=http://basic-pd-0.basic-pd-peer.myapp.svc:2380
[2022/08/04 16:15:59.786 +00:00] [INFO] [util.go:42] ["Welcome to Placement Driver (PD)"]
[2022/08/04 16:15:59.787 +00:00] [INFO] [util.go:43] [PD] [release-version=v5.4.1]
[2022/08/04 16:15:59.787 +00:00] [INFO] [util.go:44] [PD] [edition=Community]
[2022/08/04 16:15:59.787 +00:00] [INFO] [util.go:45] [PD] [git-hash=18098e99e2eacc0b327710742516910dc75059a4]
[2022/08/04 16:15:59.787 +00:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v5.4.1]
[2022/08/04 16:15:59.787 +00:00] [INFO] [util.go:47] [PD] [utc-build-time="2022-04-28 02:07:55"]
[2022/08/04 16:15:59.787 +00:00] [INFO] [metricutil.go:82] ["disable Prometheus push client"]
[2022/08/04 16:15:59.787 +00:00] [INFO] [server.go:228] ["PD Config"] [config="{\"client-urls\":\"http://0.0.0.0:2379\",\"peer-urls\":\"http://0.0.0.0:2380\",\"advertise-client-urls\":\"http://basic-pd-0.basic-pd-peer.myapp.svc:2379\",\"advertise-peer-urls\":\"http://basic-pd-0.basic-pd-peer.myapp.svc:2380\",\"name\":\"basic-pd-0\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"enable-grpc-gateway\":true,\"initial-cluster\":\"basic-pd-0=http://basic-pd-0.basic-pd-peer.myapp.svc:2380\",\"initial-cluster-state\":\"new\",\"initial-cluster-token\":\"pd-cluster\",\"join\":\"\",\"lease\":3,\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null},\"tso-save-interval\":\"3s\",\"tso-update-physical-interval\":\"50ms\",\"enable-local-tso\":false,\"metric\":{\"job\":\"basic-pd-0\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":64,\"max-pending-peer-count\":64,\"max-merge-region-size\":20,\"max-merge-region-keys\":200000,\"split-merge-interval\":\"1h0m0s\",\"enable-one-way-merge\":\"false\",\"enable-cross-table-merge\":\"true\",\"patrol-region-interval\":\"10ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"leader-schedule-policy\":\"count\",\"region-schedule-limit\":2048,\"replica-schedule-limit\":64,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":4,\"hot-region-cache-hits-threshold\":3,\"store-limit\":{},\"tolerant-size-ratio\":0,\"low-space-ratio\":0.8,\"high-space-ratio\":0.7,\"region-score-formula-version\":\"v2\",\"scheduler-max-waiting-operator\":5,\"enable-remove-down-replica\":\"true\",\"enable-replace-offline-replica\":\"true\",\"enable-make-up-replica\":\"true\",\"enable-remove-extra-replica\":\"true\",\"enable-location-replacement\":\"true\",\"enable-debug-metrics\":\"false\",\"enable-joint-consensus\":\"true\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"hot-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"}],\"schedulers-payload\":null,\"store-limit-mode\":\"manual\",\"hot-regions-write-interval\":\"10m0s\",\"hot-regions-reserved-days\":7},\"replication\":{\"max-replicas\":3,\"location-labels\":\"\",\"strictly-match-label\":\"false\",\"enable-placement-rules\":\"true\",\"enable-placement-rules-cache\":\"false\",\"isolation-level\":\"\"},\"pd-server\":{\"use-region-storage\":\"true\",\"max-gap-reset-ts\":\"24h0m0s\",\"key-type\":\"table\",\"runtime-services\":\"\",\"metric-storage\":\"\",\"dashboard-address\":\"auto\",\"trace-region-flow\":\"true\",\"flow-round-by-digit\":3},\"cluster-version\":\"0.0.0\",\"labels\":{},\"quota-backend-bytes\":\"8GiB\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"max-request-bytes\":1572864,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null,\"SSLCABytes\":null,\"SSLCertBytes\":null,\"SSLKEYBytes\":null,\"redact-info-log\":false,\"encryption\":{\"data-encryption-method\":\"plaintext\",\"data-key-rotation-period\":\"168h0m0s\",\"master-key\":{\"type\":\"plaintext\",\"key-id\":\"\",\"region\":\"\",\"endpoint\":\"\",\"path\":\"\"}}},\"label-property\":null,\"WarningMsgs\":null,\"DisableStrictReconfigCheck\":false,\"HeartbeatStreamBindInterval\":\"1m0s\",\"LeaderPriorityCheckInterval\":\"1m0s\",\"dashboard\":{\"tidb-cacert-path\":\"\",\"tidb-cert-path\":\"\",\"tidb-key-path\":\"\",\"public-path-prefix\":\"\",\"internal-proxy\":false,\"enable-telemetry\":true,\"enable-experimental\":false},\"replication-mode\":{\"replication-mode\":\"majority\",\"dr-auto-sync\":{\"label-key\":\"\",\"primary\":\"\",\"dr\":\"\",\"primary-replicas\":0,\"dr-replicas\":0,\"wait-store-timeout\":\"1m0s\",\"wait-sync-timeout\":\"1m0s\",\"wait-async-timeout\":\"2m0s\"}}}"]
[2022/08/04 16:15:59.791 +00:00] [INFO] [server.go:201] ["register REST path"] [path=/pd/api/v1]
[2022/08/04 16:15:59.791 +00:00] [INFO] [server.go:201] ["register REST path"] [path=/swagger/]
[2022/08/04 16:15:59.791 +00:00] [INFO] [server.go:201] ["register REST path"] [path=/autoscaling]
[2022/08/04 16:15:59.791 +00:00] [INFO] [distro.go:51] ["Using distribution strings"] [strings={}]
[2022/08/04 16:15:59.794 +00:00] [INFO] [server.go:201] ["register REST path"] [path=/dashboard/api/]
[2022/08/04 16:15:59.794 +00:00] [INFO] [server.go:201] ["register REST path"] [path=/dashboard/]
[2022/08/04 16:15:59.794 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2022/08/04 16:15:59.794 +00:00] [INFO] [systimemon.go:28] ["start system time monitor"]
[2022/08/04 16:15:59.798 +00:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[http://0.0.0.0:2379]"]
[2022/08/04 16:15:59.798 +00:00] [INFO] [etcd.go:602] ["pprof is enabled"] [path=/debug/pprof]
[2022/08/04 16:15:59.798 +00:00] [INFO] [etcd.go:299] ["starting an etcd server"] [etcd-version=3.4.3] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.16.4] [go-os=linux] [go-arch=amd64] [max-cpu-set=32] [max-cpu-available=32] [member-initialized=false] [name=basic-pd-0] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://basic-pd-0.basic-pd-peer.myapp.svc:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://basic-pd-0.basic-pd-peer.myapp.svc:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster="basic-pd-0=http://basic-pd-0.basic-pd-peer.myapp.svc:2380"] [initial-cluster-state=new] [initial-cluster-token=pd-cluster] [quota-size-bytes=8589934592] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2022/08/04 16:15:59.812 +00:00] [INFO] [backend.go:79] ["opened backend db"] [path=/var/lib/pd/member/snap/db] [took=3.83496ms]
[2022/08/04 16:15:59.815 +00:00] [INFO] [netutil.go:112] ["resolved URL Host"] [url=http://basic-pd-0.basic-pd-peer.myapp.svc:2380] [host=basic-pd-0.basic-pd-peer.myapp.svc:2380] [resolved-addr=10.240.113.2:2380]
[2022/08/04 16:15:59.856 +00:00] [INFO] [netutil.go:112] ["resolved URL Host"] [url=http://basic-pd-0.basic-pd-peer.myapp.svc:2380] [host=basic-pd-0.basic-pd-peer.myapp.svc:2380] [resolved-addr=10.240.113.2:2380]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:456] ["starting local member"] [local-member-id=9e69222b33e5d7b5] [cluster-id=ac2544314c90b8b1]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:1530] ["9e69222b33e5d7b5 switched to configuration voters=()"]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:700] ["9e69222b33e5d7b5 became follower at term 0"]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:383] ["newRaft 9e69222b33e5d7b5 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:700] ["9e69222b33e5d7b5 became follower at term 1"]
[2022/08/04 16:15:59.871 +00:00] [INFO] [raft.go:1530] ["9e69222b33e5d7b5 switched to configuration voters=(11414692299496871861)"]
[2022/08/04 16:15:59.882 +00:00] [WARN] [store.go:1317] ["simple token is not cryptographically signed"]
[2022/08/04 16:15:59.884 +00:00] [INFO] [quota.go:126] ["enabled backend quota"] [quota-name=v3-applier] [quota-size-bytes=8589934592] [quota-size="8.6 GB"]
[2022/08/04 16:15:59.888 +00:00] [INFO] [server.go:792] ["starting etcd server"] [local-member-id=9e69222b33e5d7b5] [local-server-version=3.4.3] [cluster-version=to_be_decided]
[2022/08/04 16:15:59.888 +00:00] [INFO] [server.go:658] ["started as single-node; fast-forwarding election ticks"] [local-member-id=9e69222b33e5d7b5] [forward-ticks=5] [forward-duration=2.5s] [election-ticks=6] [election-timeout=3s]
[2022/08/04 16:15:59.889 +00:00] [INFO] [raft.go:1530] ["9e69222b33e5d7b5 switched to configuration voters=(11414692299496871861)"]
[2022/08/04 16:15:59.889 +00:00] [INFO] [cluster.go:392] ["added member"] [cluster-id=ac2544314c90b8b1] [local-member-id=9e69222b33e5d7b5] [added-peer-id=9e69222b33e5d7b5] [added-peer-peer-urls="[http://basic-pd-0.basic-pd-peer.myapp.svc:2380]"]
[2022/08/04 16:15:59.890 +00:00] [INFO] [etcd.go:241] ["now serving peer/client/metrics"] [local-member-id=9e69222b33e5d7b5] [initial-advertise-peer-urls="[http://basic-pd-0.basic-pd-peer.myapp.svc:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://basic-pd-0.basic-pd-peer.myapp.svc:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"]
[2022/08/04 16:15:59.890 +00:00] [INFO] [etcd.go:576] ["serving peer traffic"] [address=0.0.0.0:2380]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:923] ["9e69222b33e5d7b5 is starting a new election at term 1"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:729] ["9e69222b33e5d7b5 became pre-candidate at term 1"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:824] ["9e69222b33e5d7b5 received MsgPreVoteResp from 9e69222b33e5d7b5 at term 1"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:713] ["9e69222b33e5d7b5 became candidate at term 2"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:824] ["9e69222b33e5d7b5 received MsgVoteResp from 9e69222b33e5d7b5 at term 2"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [raft.go:765] ["9e69222b33e5d7b5 became leader at term 2"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [node.go:325] ["raft.node: 9e69222b33e5d7b5 elected leader 9e69222b33e5d7b5 at term 2"]
[2022/08/04 16:16:01.375 +00:00] [INFO] [server.go:2501] ["setting up initial cluster version"] [cluster-version=3.4]
[2022/08/04 16:16:01.376 +00:00] [INFO] [server.go:2016] ["published local member to cluster through raft"] [local-member-id=9e69222b33e5d7b5] [local-member-attributes="{Name:basic-pd-0 ClientURLs:[http://basic-pd-0.basic-pd-peer.myapp.svc:2379]}"] [request-path=/0/members/9e69222b33e5d7b5/attributes] [cluster-id=ac2544314c90b8b1] [publish-timeout=11s]
[2022/08/04 16:16:01.378 +00:00] [INFO] [serve.go:139] ["serving client traffic insecurely; this is strongly discouraged!"] [address=0.0.0.0:2379]
[2022/08/04 16:16:01.379 +00:00] [INFO] [cluster.go:558] ["set initial cluster version"] [cluster-id=ac2544314c90b8b1] [local-member-id=9e69222b33e5d7b5] [cluster-version=3.4]
[2022/08/04 16:16:01.379 +00:00] [INFO] [capability.go:76] ["enabled capabilities for version"] [cluster-version=3.4]
[2022/08/04 16:16:01.379 +00:00] [INFO] [server.go:2533] ["cluster version is updated"] [cluster-version=3.4]
[2022/08/04 16:16:01.398 +00:00] [INFO] [server.go:303] ["create etcd v3 client"] [endpoints="[http://basic-pd-0.basic-pd-peer.myapp.svc:2379]"] [cert="{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null,\"SSLCABytes\":null,\"SSLCertBytes\":null,\"SSLKEYBytes\":null,\"redact-info-log\":false,\"encryption\":{\"data-encryption-method\":\"plaintext\",\"data-key-rotation-period\":\"168h0m0s\",\"master-key\":{\"type\":\"plaintext\",\"key-id\":\"\",\"region\":\"\",\"endpoint\":\"\",\"path\":\"\"}}}"]
[2022/08/04 16:16:01.403 +00:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7128055549888603271]
[2022/08/04 16:16:01.425 +00:00] [INFO] [history_buffer.go:147] ["start from history index"] [start-index=0]
[2022/08/04 16:16:01.443 +00:00] [INFO] [server.go:1226] ["start to campaign pd leader"] [campaign-pd-leader-name=basic-pd-0]
[2022/08/04 16:16:01.443 +00:00] [INFO] [lease.go:65] ["lease granted"] [lease-id=6320101042326176524] [lease-timeout=3] [purpose="pd leader election"]
[2022/08/04 16:16:01.445 +00:00] [INFO] [leadership.go:122] ["check campaign resp"] [resp="{\"header\":{\"cluster_id\":12404395727190538417,\"member_id\":11414692299496871861,\"revision\":6,\"raft_term\":2},\"succeeded\":true,\"responses\":[{\"Response\":{\"ResponsePut\":{\"header\":{\"revision\":6}}}}]}"]
[2022/08/04 16:16:01.445 +00:00] [INFO] [leadership.go:131] ["write leaderData to leaderPath ok"] [leaderPath=/pd/7128055549888603271/leader] [purpose="pd leader election"]
[2022/08/04 16:16:01.445 +00:00] [INFO] [server.go:1252] ["campaign pd leader ok"] [campaign-pd-leader-name=basic-pd-0]
[2022/08/04 16:16:01.445 +00:00] [INFO] [server.go:1259] ["initializing the global TSO allocator"]
[2022/08/04 16:16:01.445 +00:00] [INFO] [lease.go:129] ["start lease keep alive worker"] [interval=1s] [purpose="pd leader election"]
[2022/08/04 16:16:01.447 +00:00] [INFO] [tso.go:218] ["sync and save timestamp"] [last=0001/01/01 00:00:00.000 +00:00] [save=2022/08/04 16:16:04.447 +00:00] [next=2022/08/04 16:16:01.447 +00:00]
[2022/08/04 16:16:01.448 +00:00] [INFO] [server.go:1352] ["server enable region storage"]
[2022/08/04 16:16:01.456 +00:00] [INFO] [id.go:122] ["idAllocator allocates a new id"] [alloc-id=1000]
[2022/08/04 16:16:01.456 +00:00] [INFO] [util.go:78] ["load cluster version"] [cluster-version=0.0.0]
[2022/08/04 16:16:01.456 +00:00] [INFO] [server.go:1303] ["PD cluster leader is ready to serve"] [pd-leader-name=basic-pd-0]
[2022/08/04 16:16:02.444 +00:00] [INFO] [server.go:962] ["PD server config is updated"] [new="{\"use-region-storage\":\"true\",\"max-gap-reset-ts\":\"24h0m0s\",\"key-type\":\"table\",\"runtime-services\":\"\",\"metric-storage\":\"\",\"dashboard-address\":\"http://basic-pd-0.basic-pd-peer.myapp.svc:2379\",\"trace-region-flow\":\"true\",\"flow-round-by-digit\":3}"] [old="{\"use-region-storage\":\"true\",\"max-gap-reset-ts\":\"24h0m0s\",\"key-type\":\"table\",\"runtime-services\":\"\",\"metric-storage\":\"\",\"dashboard-address\":\"auto\",\"trace-region-flow\":\"true\",\"flow-round-by-digit\":3}"]
[2022/08/04 16:16:03.446 +00:00] [INFO] [dbstore.go:33] ["Dashboard initializing local storage file"] [path=/var/lib/pd/dashboard.sqlite.db]
[2022/08/04 16:16:03.630 +00:00] [INFO] [version.go:33] ["TiDB Dashboard started"] [internal-version=2022.01.17.1] [standalone=No] [pd-version=v5.4.1] [build-time="2022-04-28 02:07:55"] [build-git-hash=e8076b5c79ba]
[2022/08/04 16:16:03.630 +00:00] [INFO] [proxy.go:209] ["start serve requests to remotes"] [endpoint=127.0.0.1:37185] [remotes="[]"]
[2022/08/04 16:16:03.630 +00:00] [INFO] [manager.go:205] ["Dashboard server is started"]
[2022/08/04 16:16:03.630 +00:00] [INFO] [proxy.go:209] ["start serve requests to remotes"] [endpoint=127.0.0.1:43231] [remotes="[]"]
[2022/08/04 16:16:03.630 +00:00] [WARN] [dynamic_config_manager.go:164] ["Dynamic config does not exist in etcd"]
[2022/08/04 16:16:03.631 +00:00] [INFO] [dynamic_config_manager.go:188] ["Save dynamic config to etcd"] [json="{\"keyvisual\":{\"auto_collection_disabled\":false,\"policy\":\"db\",\"policy_kv_separator\":\"\"},\"profiling\":{\"auto_collection_targets\":null,\"auto_collection_duration_secs\":0,\"auto_collection_interval_secs\":0},\"sso\":{\"core_config\":{\"enabled\":false,\"client_id\":\"\",\"discovery_url\":\"\",\"is_read_only\":false},\"auth_url\":\"\",\"token_url\":\"\",\"user_info_url\":\"\",\"sign_out_url\":\"\"}}"]
[2022/08/04 16:16:03.690 +00:00] [INFO] [manager.go:74] ["Key visual service is started"]

$ k logs basic-discovery-6dff9bd7bf-6495n
I0804 16:15:51.537648       1 version.go:38] Welcome to TiDB Operator.
I0804 16:15:51.537687       1 version.go:39] TiDB Operator Version: version.Info{GitVersion:"v1.3.6", GitCommit:"acf57346c962a0bdb9d5c1de8870c332c5adc185", GitTreeState:"clean", BuildDate:"2022-07-05T02:20:15Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
I0804 16:15:51.840059       1 main.go:109] starting TiDB Discovery server, listening on 0.0.0.0:10261
I0804 16:15:51.840510       1 main.go:116] starting TiDB Proxy server, listening on 0.0.0.0:10262
I0804 16:15:52.528027       1 discovery.go:78] advertisePeerUrl is: basic-pd-0.basic-pd-peer.myapp.svc:2380
I0804 16:15:52.542638       1 server.go:98] generated args for basic-pd-0.basic-pd-peer.myapp.svc:2380
: --initial-cluster=basic-pd-0=http://basic-pd-0.basic-pd-peer.myapp.svc:2380
, register-type: pd
KanShiori commented 2 years ago

Operator deploys TiDB and TiKV after PD is ready, is there any error in the PD log?

davidp1404 commented 2 years ago

Hello, I've found the root causes of my problem, just for your knowledge. I had a networkpolicy that limit connections to pods from inside the namespace as best practices dictates. I've discovered that deleting it the missed sts are shown, so my conclusion is that your controller checks reachability after discovery and pd components are ready before launching the tikv and tidb sts. Unfortunatelly your controller doesn't report any error so quite difficult to find out (opportunity to improve?). On the other hand, if you have resourcequotas and limitrange as best practices dictates, your sample doesn't work reporting this error in tikv component:

$ k logs basic-tikv-0
starting tikv-server ...
/tikv-server --pd=http://basic-pd:2379 --advertise-addr=basic-tikv-0.basic-tikv-peer.csw-int-mvp.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=basic-tikv-0.basic-tikv-peer.csw-int-mvp.svc:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml

[2022/08/09 13:32:48.078 +00:00] [INFO] [lib.rs:81] ["Welcome to TiKV"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Release Version:   5.4.1"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Edition:           Community"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Git Commit Hash:   91fe561f0af87cc47359cdf61d6e6838471cb644"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Git Commit Branch: heads/refs/tags/v5.4.1"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["UTC Build Time:    Unknown (env var does not exist when building)"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Rust Version:      rustc 1.56.0-nightly (2faabf579 2021-07-27)"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Enable Features:   jemalloc mem-profiling portable sse test-engines-rocksdb cloud-aws cloud-gcp cloud-azure"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [lib.rs:86] ["Profile:           dist_release"]
[2022/08/09 13:32:48.079 +00:00] [INFO] [mod.rs:73] ["cgroup quota: memory=524288000, cpu=Some(0.5), cores={9, 29, 19, 30, 8, 15, 18, 12, 10, 2, 16, 5, 20, 23, 31, 25, 27, 26, 13, 11, 7, 4, 3, 6, 14, 0, 22, 24, 1, 28, 17, 21}"]
[2022/08/09 13:32:48.096 +00:00] [INFO] [mod.rs:80] ["memory limit in bytes: 524288000, cpu cores quota: 0.5"]
[2022/08/09 13:32:48.096 +00:00] [WARN] [server.rs:1426] ["check: kernel"] [err="kernel parameters net.core.somaxconn got 4096, expect 32768"]
[2022/08/09 13:32:48.096 +00:00] [WARN] [server.rs:1426] ["check: kernel"] [err="kernel parameters net.ipv4.tcp_syncookies got 1, expect 0"]
[2022/08/09 13:32:48.096 +00:00] [WARN] [server.rs:1426] ["check: kernel"] [err="kernel parameters vm.swappiness got 60, expect 0"]
[2022/08/09 13:32:48.152 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://basic-pd:2379]
[2022/08/09 13:32:48.154 +00:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because socket() failed."]
[2022/08/09 13:32:48.155 +00:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2022/08/09 13:32:48.156 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f8567818330 for subchannel 0x7f856782d1c0"]
[2022/08/09 13:32:48.157 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://basic-pd-0.basic-pd-peer.csw-int-mvp.svc:2379]
[2022/08/09 13:32:48.160 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f856703c390 for subchannel 0x7f856702a1c0"]
[2022/08/09 13:32:48.161 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://basic-pd-0.basic-pd-peer.csw-int-mvp.svc:2379]
[2022/08/09 13:32:48.162 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f856663c390 for subchannel 0x7f856662a1c0"]
[2022/08/09 13:32:48.163 +00:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://basic-pd-0.basic-pd-peer.csw-int-mvp.svc:2379]
[2022/08/09 13:32:48.163 +00:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"http://basic-pd:2379\"]"]
[2022/08/09 13:32:48.164 +00:00] [INFO] [server.rs:347] ["connect to PD cluster"] [cluster_id=7129868439341324402]
[2022/08/09 13:32:48.164 +00:00] [INFO] [config.rs:1982] ["readpool.storage.use-unified-pool is not set, set to true by default"]
[2022/08/09 13:32:48.164 +00:00] [INFO] [config.rs:2005] ["readpool.coprocessor.use-unified-pool is not set, set to true by default"]
[2022/08/09 13:32:48.164 +00:00] [FATAL] [setup.rs:302] ["invalid configuration: max_background_jobs should be greater than 0 and less than or equal to 0"]

At least we have an error to search, but definitely not easy to map to too low resources as a result of applying default limitrange policy. In the end, I can say tidb works in a k8s cluster with common best practices applied, I hope it helps you to make your great product indeed better, and ideally, include in your documentation information to guide the development of networkpolicies and minimums for limitranges. Thanks!

yiduoyunQ commented 2 years ago

see operator controller has reported about this error

E0804 16:26:05.774418       1 tidb_cluster_controller.go:126] TidbCluster: myapp/basic, sync failed TidbCluster: myapp/basic .Status.PD.Synced = false, can't failover, requeuing
E0804 16:26:35.762262       1 pd_member_manager.go:190] failed to sync TidbCluster: [myapp/basic]'s status, error: Get http://basic-pd.myapp:2379/pd/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

it is reported from source pd_member_manager.go#L335-L343 , operator will sync pd first and retry continually if it return failed, which will block tikv and tidb sync logic.

operator act as pdClient and fetch PD health info from /pd/api/v1/health API, which will meet Client.Timeout exceeded error when we have network problem.

davidp1404 commented 1 year ago

Thanks @yiduoyunQ, realizing this communication channel is needed solved my issue.