pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.2k stars 491 forks source link

PD server started failed when using pvc with cephfs #3515

Closed deadjoker closed 3 years ago

deadjoker commented 3 years ago

Question

I'm using cephfs storageclass for pvc in k8s. The pd server failed to start when I attached a cephfs pvc in the pod. here is the log:

Name:      basic-pd-0.basic-pd-peer.tidb-cluster.svc
Address 1: 100.64.165.227 basic-pd-0.basic-pd-peer.tidb-cluster.svc.cluster.local
nslookup domain basic-pd-0.basic-pd-peer.tidb-cluster.svc.svc success
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=basic-pd-0 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2379 --config=/etc/pd/pd.toml --initial-cluster=basic-pd-0=http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:42] ["Welcome to Placement Driver (PD)"]
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:43] [PD] [release-version=v4.0.8]
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:44] [PD] [edition=Community]
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:45] [PD] [git-hash=775b6a5ef517f8ab2f43fef6418bbfc7d6c9c9dc]
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:46] [PD] [git-branch=heads/refs/tags/v4.0.8]
[2020/11/24 06:58:36.885 +00:00] [INFO] [util.go:47] [PD] [utc-build-time="2020-10-30 08:15:09"]
[2020/11/24 06:58:36.885 +00:00] [INFO] [metricutil.go:81] ["disable Prometheus push client"]
[2020/11/24 06:58:36.885 +00:00] [INFO] [server.go:216] ["PD Config"] [config="{\"client-urls\":\"http://0.0.0.0:2379\",\"peer-urls\":\"http://0.0.0.0:2380\",\"advertise-client-urls\":\"http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2379\",\"advertise-peer-urls\":\"http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380\",\"name\":\"basic-pd-0\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"enable-grpc-gateway\":true,\"initial-cluster\":\"basic-pd-0=http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380\",\"initial-cluster-state\":\"new\",\"initial-cluster-token\":\"pd-cluster\",\"join\":\"\",\"lease\":3,\"log\":{\"level\":\"\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null},\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"basic-pd-0\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":20,\"max-merge-region-keys\":200000,\"split-merge-interval\":\"1h0m0s\",\"enable-one-way-merge\":\"false\",\"enable-cross-table-merge\":\"false\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"leader-schedule-policy\":\"count\",\"region-schedule-limit\":2048,\"replica-schedule-limit\":64,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":4,\"hot-region-cache-hits-threshold\":3,\"store-limit\":null,\"tolerant-size-ratio\":0,\"low-space-ratio\":0.8,\"high-space-ratio\":0.7,\"scheduler-max-waiting-operator\":5,\"enable-remove-down-replica\":\"true\",\"enable-replace-offline-replica\":\"true\",\"enable-make-up-replica\":\"true\",\"enable-remove-extra-replica\":\"true\",\"enable-location-replacement\":\"true\",\"enable-debug-metrics\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"hot-region\",\"args\":null,\"disable\":false,\"args-payload\":\"\"},{\"type\":\"label\",\"args\":null,\"disable\":false,\"args-payload\":\"\"}],\"schedulers-payload\":null,\"store-limit-mode\":\"manual\"},\"replication\":{\"max-replicas\":3,\"location-labels\":\"\",\"strictly-match-label\":\"false\",\"enable-placement-rules\":\"false\"},\"pd-server\":{\"use-region-storage\":\"true\",\"max-gap-reset-ts\":\"24h0m0s\",\"key-type\":\"table\",\"runtime-services\":\"\",\"metric-storage\":\"\",\"dashboard-address\":\"auto\",\"trace-region-flow\":\"true\"},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"8GiB\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null},\"label-property\":null,\"WarningMsgs\":null,\"DisableStrictReconfigCheck\":false,\"HeartbeatStreamBindInterval\":\"1m0s\",\"LeaderPriorityCheckInterval\":\"1m0s\",\"dashboard\":{\"tidb-cacert-path\":\"\",\"tidb-cert-path\":\"\",\"tidb-key-path\":\"\",\"public-path-prefix\":\"\",\"internal-proxy\":false,\"enable-telemetry\":true,\"enable-experimental\":false},\"replication-mode\":{\"replication-mode\":\"majority\",\"dr-auto-sync\":{\"label-key\":\"\",\"primary\":\"\",\"dr\":\"\",\"primary-replicas\":0,\"dr-replicas\":0,\"wait-store-timeout\":\"1m0s\",\"wait-sync-timeout\":\"1m0s\"}}}"]
[2020/11/24 06:58:36.888 +00:00] [INFO] [server.go:189] ["register REST path"] [path=/pd/api/v1]
[2020/11/24 06:58:36.889 +00:00] [INFO] [server.go:189] ["register REST path"] [path=/swagger/]
[2020/11/24 06:58:36.890 +00:00] [INFO] [server.go:189] ["register REST path"] [path=/dashboard/api/]
[2020/11/24 06:58:36.890 +00:00] [INFO] [server.go:189] ["register REST path"] [path=/dashboard/]
[2020/11/24 06:58:36.891 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[http://0.0.0.0:2380]"]
[2020/11/24 06:58:36.891 +00:00] [INFO] [systime_mon.go:27] ["start system time monitor"]
[2020/11/24 06:58:36.891 +00:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[http://0.0.0.0:2379]"]
[2020/11/24 06:58:36.891 +00:00] [INFO] [etcd.go:602] ["pprof is enabled"] [path=/debug/pprof]
[2020/11/24 06:58:36.892 +00:00] [INFO] [etcd.go:299] ["starting an etcd server"] [etcd-version=3.4.3] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.13] [go-os=linux] [go-arch=amd64] [max-cpu-set=12] [max-cpu-available=12] [member-initialized=false] [name=basic-pd-0] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster="basic-pd-0=http://basic-pd-0.basic-pd-peer.tidb-cluster.svc:2380"] [initial-cluster-state=new] [initial-cluster-token=pd-cluster] [quota-size-bytes=8589934592] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2020/11/24 06:58:36.918 +00:00] [PANIC] [backend.go:157] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="value too large for defined data type"]
panic: failed to open database

goroutine 184 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc001aa2000, 0xc001aa6380, 0x2, 0x2)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.uber.org/zap@v1.15.0/zapcore/entry.go:230 +0x546
go.uber.org/zap.(*Logger).Panic(0xc0003a81e0, 0x2428153, 0x17, 0xc001aa6380, 0x2, 0x2)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.uber.org/zap@v1.15.0/logger.go:225 +0x7f
go.etcd.io/etcd/mvcc/backend.newBackend(0xc0017d6480, 0x1a, 0x5f5e100, 0x2710, 0x2408f73, 0x5, 0x233333333, 0xc0003a81e0, 0x0)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/mvcc/backend/backend.go:157 +0x3de
go.etcd.io/etcd/mvcc/backend.New(...)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/mvcc/backend/backend.go:137
go.etcd.io/etcd/etcdserver.newBackend(0x7ffcd529b931, 0xa, 0x0, 0x0, 0x0, 0x0, 0xc0017dc900, 0x1, 0x1, 0xc0017dc700, ...)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/backend.go:52 +0x19e
go.etcd.io/etcd/etcdserver.openBackend.func1(0xc001a98120, 0xc001aca000)
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/backend.go:73 +0x95
created by go.etcd.io/etcd/etcdserver.openBackend
    /home/jenkins/agent/workspace/build_pd_multi_branch_v4.0.8/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/etcdserver/backend.go:72 +0x110

The storageclass is well worked on other service pod. What's the issue?

DanielZhangQD commented 3 years ago

@deadjoker It may be caused by broken data, you can try to delete the tc and reinstall it, please make sure that the PVs do not have discarded data.

deadjoker commented 3 years ago

@DanielZhangQD the PVs were created when the pod started so that they were clean directories. After the failed of pod, I checked the dirs on ceph and found that the 'pd/member/snap/db' was created but the log said the file was broken because of the error "value too large for defined data type".

deadjoker commented 3 years ago

Besides, I tried reinstall it but had no luck as well.

deadjoker commented 3 years ago

@DanielZhangQD What disappointed me is that I set the replica to 3 and 2 of the pods running well. Only the 3rd pod cannot start successfully.

DanielZhangQD commented 3 years ago

@deadjoker OK, understand that. cc @dragonly Please help follow up this issue with PD team, thanks!

deadjoker commented 3 years ago

@DanielZhangQD @dragonly our test environment is tidb-operator 1.1.7, k8s 1.18.9, docker 19.03.5 ceph 15.2.5 ceph-csi 3.1.0

dragonly commented 3 years ago

@deadjoker hi, could you please do the following things, make sure that the issue is reproduceable:

  1. delete the TidbCluster and all related PVCs, for example kubectl delete tc --all -n ${tidb-cluster-ns} && kubectl delete pvc --all -n ${tidb-cluster-ns}
  2. wait for all pods to terminate, and make sure that all PVCs are deleted
  3. create a new TidbCluster using the original yaml
  4. see if the issue occurs again
deadjoker commented 3 years ago

@dragonly I tried create new cluster using the original yaml, it succeeded. And then I created the cluster with localstorage storageclass successfully as well.

dragonly commented 3 years ago

@deadjoker :+1: Feel free to report issue here if any goes wrong again.

PTAL at the potential PD issue here @Yisaer , as reported in the log [2020/11/24 06:58:36.918 +00:00] [PANIC] [backend.go:157] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="value too large for defined data type"]