tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 721 forks source link

PD is corrupting `etcd` database on restart #8547

Open faelau opened 3 months ago

faelau commented 3 months ago

Bug Report

If you restart a PD pod, you receive the following panic:

[2024/08/19 15:16:25.624 +00:00] [WARN] [server.go:297] ["exceeded recommended request limit"] [max-request-bytes=157286400] [max-request-size="157 MB"] [recommended-request-bytes=10485760] [recommended-request-size="10 MB"]
2024-08-19 15:16:25.624904 W | pkg/fileutil: check file permission: directory "/var/lib/pd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
[2024/08/19 15:16:25.636 +00:00] [PANIC] [backend.go:173] ["failed to open database"] [path=/var/lib/pd/member/snap/db] [error="invalid database"]
panic: failed to open database
goroutine 251 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0x2?, {0x0?, 0x0?, 0xc0001364a0?})
    /root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0006f52b0, {0xc0012b9980, 0x2, 0x2})
    /root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc001299f80?, {0x304f490?, 0x16?}, {0xc0012b9980, 0x2, 0x2})
    /root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285 +0x51
go.etcd.io/etcd/mvcc/backend.newBackend({{0xc001299f80, 0x1a}, 0x5f5e100, 0x2710, {0x30191e2, 0x5}, 0x233333333, 0xc000053980, 0x0})
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20240320135013-950cd5fbe6ca/mvcc/backend/backend.go:173 +0x35c
go.etcd.io/etcd/mvcc/backend.New(...)
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20240320135013-950cd5fbe6ca/mvcc/backend/backend.go:151
go.etcd.io/etcd/etcdserver.newBackend({{0x7ffd58f397d8, 0xe}, {0x0, 0x0}, {0x0, 0x0}, {0xc0003086c0, 0x1, 0x1}, {0xc000308480, ...}, ...})
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20240320135013-950cd5fbe6ca/etcdserver/backend.go:53 +0x3b0
go.etcd.io/etcd/etcdserver.openBackend.func1()
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20240320135013-950cd5fbe6ca/etcdserver/backend.go:74 +0x45
created by go.etcd.io/etcd/etcdserver.openBackend in goroutine 1
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20240320135013-950cd5fbe6ca/etcdserver/backend.go:73 +0x106

The PVC hat the cephfs.csi.ceph.com provisioner. The cluster is running on microk8s.

Checking the etcd database with bbolt, digging a bit deeper results in the following error:

$ ./go/bin/bbolt page --all --format-value=redacted db
cannot read number of pages: the Meta Page has wrong (unexpected) magic

What did you do?

  1. Create a new TidbCluster:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: surrealdb
spec:
  version: v8.2.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    replicas: 1
    maxFailoverCount: 0
    mountClusterClientSecret: true
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    evictLeaderTimeout: 1m
    replicas: 3
    storageClassName: csi-cephfs-sc
    requests:
      storage: "16Gi"
    config:
      storage:
        reserve-space: "0MB"
      rocksdb:
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 5
    service:
      type: ClusterIP
    config: {}
  1. Restart PD pod (e.g. if you drain a node on updating Kubernetes)
  2. Getting the panic

What did you expect to see?

pd not corrupting the etcd database.

What did you see instead?

A panic of the PD container because the etcd database is corrupted.

What version of PD are you using (pd-server -V)?

[root@surrealdb-pd-0 /]# ./pd-server -V
Release Version: v8.2.0
Edition: Community
Git Commit Hash: c0ee2cd6c2eea7ad9372cc5bd00f6774abad6834
Git Branch: HEAD
UTC Build Time:  2024-07-04 09:39:38
faelau commented 3 weeks ago

Having some news on this.

The panic is only happening when mounting PVCs with the CEPH kernel driver. If the PVCs are mounted with the fuse driver, the panic isn't happening.

Also this happened some days ago with another software using BoltDB, so this seems to be some kind of an upstream issue on boltdb/etcd?

Maybe an issue there should also be opened?