rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.58k stars 270 forks source link

Creation snapshot via rancher of managed RKE2 with ETCD database larger than 1G cause issues. #6385

Closed riuvshyn closed 4 months ago

riuvshyn commented 4 months ago

Environmental Info: RKE2 Version: v1.28.11+rke2r1

Rancher: 2.8.3

Node(s) CPU architecture, OS, and Version:

Linux ip-172-23-72-173 5.15.0-1064-aws #70~20.04.1-Ubuntu SMP Fri Jun 14 15:42:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 server 6 workers

Describe the bug: Creating snapshot for managed RKE2 cluster causing issues and sometimes kills server node when etcd dabase is larger than 1G.

Steps To Reproduce:

Expected behavior: Creating snapshot or cluster configuration change happens seamlessly without any issues not affecting cluster availability.

Actual behavior: Creating snapshot or cluster configuration change affects cluster API availability, server nodes fails.

Additional context / logs: Not only creating snapshot is triggering this issue but also cluster configuration changes seems to have similar effect, looks like during applying cluster configuration changes similar processes happening which causing similar issues.

Database size used to reproduce issue: 1.68 GiB

We also have bunch of smaller clusters where this issue doesn't happen.

Sometimes after 10-15 minues of crashlooping it self recovers, but mostly that didn't happen and we have to rotate a server node.

rke2-server service is crash looping panicing with following errors:

1.

Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: time="2024-07-22T14:12:48Z" level=info msg="Applying Pod Security Admission Configuration"
Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: time="2024-07-22T14:12:48Z" level=info msg="ETCD snapshots are disabled"
Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: time="2024-07-22T14:12:48Z" level=info msg="Starting rke2 v1.28.11+rke2r1 (6b12d7a783238b72da8450fa1b6ec587cebb79ed)"
Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: time="2024-07-22T14:12:48Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: time="2024-07-22T14:12:48Z" level=info msg="Starting temporary etcd to reconcile with datastore"
Jul 22 14:12:48 ip-172-23-74-105 rke2[122535]: W0722 14:12:48.920427  122535 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2399", ServerName: "127.0.0.1:2399", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2399: connect: connection refused"
Jul 22 14:12:49 ip-172-23-74-105 rke2[122535]: {"level":"info","ts":"2024-07-22T14:12:49.605349Z","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
Jul 22 14:12:49 ip-172-23-74-105 rke2[122535]: {"level":"info","ts":"2024-07-22T14:12:49.605579Z","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
Jul 22 14:12:49 ip-172-23-74-105 rke2[122535]: {"level":"info","ts":"2024-07-22T14:12:49.605727Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.21.11 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":true,"name":"ip-172-23-74-105-809dfacc","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":0,"max-snapshots":0,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":8589934592,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Jul 22 14:12:49 ip-172-23-74-105 rke2[122535]: {"level":"info","ts":"2024-07-22T14:12:49.679233Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"73.280551ms"}
Jul 22 14:12:49 ip-172-23-74-105 rke2[122535]: {"level":"info","ts":"2024-07-22T14:12:49.953734Z","caller":"etcdserver/server.go:511","msg":"recovered v2 store from snapshot","snapshot-index":190023,"snapshot-size":"14 kB"}
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: {"level":"warn","ts":"2024-07-22T14:12:52.840769Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":190023,"snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/000000000002e647.snap.db","error":"snap: snapshot file doesn't exist"}
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: {"level":"panic","ts":"2024-07-22T14:12:52.840849Z","caller":"etcdserver/server.go:518","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/etcdserver/server.go:518\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/embed/etcd.go:245\ngithub.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/executor/etcd.go:30\ngithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/etcd/etcd.go:1010\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cluster/bootstrap.go:501\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cluster/bootstrap.go:64\ngithub.com/k3s-io/k3s/pkg/daemons/control.prepare\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/control/server.go:253\ngithub.com/k3s-io/k3s/pkg/daemons/control.Server\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/control/server.go:35\ngithub.com/k3s-io/k3s/pkg/server.StartServer\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/server/server.go:56\ngithub.com/k3s-io/k3s/pkg/cli/server.run\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cli/server/server.go:503\ngithub.com/k3s-io/k3s/pkg/cli/server.RunWithControllers\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cli/server/server.go:50\ngithub.com/rancher/rke2/pkg/rke2.Server\n\t/source/pkg/rke2/rke2.go:125\ngithub.com/rancher/rke2/pkg/cli/cmds.ServerRun\n\t/source/pkg/cli/cmds/server.go:170\ngithub.com/urfave/cli.HandleAction\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524\ngithub.com/urfave/cli.Command.Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175\ngithub.com/urfave/cli.(*App).Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277\nmain.main\n\t/source/main.go:23\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"}
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: panic: failed to recover v3 backend from snapshot
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: goroutine 1 [running]:
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x80?, {0x0?, 0x0?, 0xc000d2e220?})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196 +0x54
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000e6f450, {0xc00075bac0, 0x1, 0x1})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262 +0x3ec
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: go.uber.org/zap.(*Logger).Panic(0xc000f91900?, {0x3e711ff?, 0x0?}, {0xc00075bac0, 0x1, 0x1})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285 +0x51
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: go.etcd.io/etcd/server/v3/etcdserver.NewServer({{0xc000f91900, 0x19}, {0x0, 0x0}, {0x0, 0x0}, {0xc000f96d80, 0x1, 0x1}, {0xc000f96c60, ...}, ...})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/etcdserver/server.go:518 +0x1d6b
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000778300)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/embed/etcd.go:245 +0x10d8
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD(_, {_, _}, {{{0xc0006ac8d0, 0x15}, {0x0, 0x0}, {0x0, 0x0}}, {0xc0000ad4e0, ...}, ...}, ...)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/executor/etcd.go:30 +0x12e
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary(0xc00110d1d0, {0x464bd38, 0xc00110d180})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/etcd/etcd.go:1010 +0x6d8
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd(0xc00069d5f0, {0x464bd38, 0xc000b6f3b0})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cluster/bootstrap.go:501 +0x285
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap(0xc00069d5f0, {0x464bd38, 0xc000b6f3b0}, 0x0)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cluster/bootstrap.go:64 +0x1cb
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/daemons/control.prepare({0x464bd38, 0xc000b6f3b0}, 0xc000cce008)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/control/server.go:253 +0x2c7
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/daemons/control.Server({0x464bd38?, 0xc000b6f3b0}, 0xc000cce008)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/daemons/control/server.go:35 +0xf7
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/server.StartServer({0x464bd38?, 0xc000b6f3b0}, 0xc000cce000, 0x1?)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/server/server.go:56 +0x85
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/cli/server.run(0xc000cd4dc0, 0x68af260, {0xc001651090, 0x1, 0x36be440?}, {0xc00133e908, 0x0, 0x8?})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cli/server/server.go:503 +0x35b1
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/k3s-io/k3s/pkg/cli/server.RunWithControllers(...)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/k3s-io/k3s@v1.28.11-rc2.0.20240619164820-617b0e84f419/pkg/cli/server/server.go:50
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/rancher/rke2/pkg/rke2.Server(_, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0xc001172436, 0x8}, {{0x0, ...}, ...}, ...})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /source/pkg/rke2/rke2.go:125 +0x685
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/rancher/rke2/pkg/cli/cmds.ServerRun(0xc000cd4dc0?)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /source/pkg/cli/cmds/server.go:170 +0x89
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/urfave/cli.HandleAction({0x35f9980?, 0x41e53d0?}, 0x6?)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524 +0x50
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/urfave/cli.Command.Run({{0x3dd8de3, 0x6}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x3e0cbb5, 0x15}, {0xc0011715a8, ...}, ...}, ...)
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175 +0x63e
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: github.com/urfave/cli.(*App).Run(0xc000dd6c40, {0xc001192a00, 0x64, 0x70})
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277 +0xb27
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]: main.main()
Jul 22 14:12:52 ip-172-23-74-105 rke2[122535]:         /source/main.go:23 +0x97b
Jul 22 14:12:52 ip-172-23-74-105 systemd[1]: rke2-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

2.

Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: time="2024-07-22T11:57:02Z" level=info msg="Applying Pod Security Admission Configuration"
Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: time="2024-07-22T11:57:02Z" level=info msg="ETCD snapshots are disabled"
Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: time="2024-07-22T11:57:02Z" level=info msg="Starting rke2 v1.28.11+rke2r1 (6b12d7a783238b72da8450fa1b6ec587cebb79ed)"
Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: time="2024-07-22T11:57:02Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: time="2024-07-22T11:57:02Z" level=info msg="Starting temporary etcd to reconcile with datastore"
Jul 22 11:57:02 ip-172-23-71-220 rke2[879329]: W0722 11:57:02.203920  879329 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2399", ServerName: "127.0.0.1:2399", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2399: connect: connection refused"
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: W0722 11:57:03.204324  879329 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2399", ServerName: "127.0.0.1:2399", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2399: connect: connection refused"
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: {"level":"info","ts":"2024-07-22T11:57:03.858429Z","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: {"level":"info","ts":"2024-07-22T11:57:03.858704Z","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: {"level":"info","ts":"2024-07-22T11:57:03.858847Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.21.11 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":32,"max-cpu-available":32,"member-initialized":true,"name":"ip-172-23-71-220.eu-central-1.compute.internal-a5c8494c","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":0,"max-snapshots":0,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: panic: assertion failed: Page expected to be: 427240, but self identifies as 2482454681943172642
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: goroutine 277 [running]:
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt._assert(...)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:1387
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*page).fastCheck(0x7f758e474000, 0x684e8)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/page.go:57 +0x1d9
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*Tx).page(0x0?, 0x0?)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/tx.go:534 +0x7b
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f77ed4123a8?, {0xc001148820?, 0x1, 0xa}, 0xc0013f15a8)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/tx.go:546 +0x5d
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*Tx).forEachPage(...)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/tx.go:542
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc000fe09a0, 0xc000fe09b8, 0xc0013f1770, 0xc0013f1740, {0x462a198?, 0x68e6fa0}, 0xc00004ab40)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/tx_check.go:83 +0x111
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*DB).freepages(0x3ddd43e?)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:1205 +0x225
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:417 +0xc5
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: sync.(*Once).doSlow(0x412f85?, 0x219ba60?)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /usr/local/go/src/sync/once.go:74 +0xbf
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: sync.(*Once).Do(...)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /usr/local/go/src/sync/once.go:65
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc000d5f200?)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:413 +0x45
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/bbolt.Open({0xc000d49a80, 0x37}, 0x0?, 0xc000037c20)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:295 +0x430
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000d49a80, 0x37}, 0x5f5e100, 0x2710, {0x3ddd43e, 0x7}, 0x280000000, 0xc000f93300, 0x0, 0x0, ...})
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/mvcc/backend/backend.go:187 +0x235
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/mvcc/backend/backend.go:163
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0xc000d48ac0, 0x37}, {0x0, 0x0}, {0x0, 0x0}, {0xc001296a20, 0x1, 0x1}, {0xc001296900, ...}, ...}, ...)
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/etcdserver/backend.go:55 +0x398
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/etcdserver/backend.go:76 +0x59
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]: created by go.etcd.io/etcd/server/v3/etcdserver.openBackend in goroutine 1
Jul 22 11:57:03 ip-172-23-71-220 rke2[879329]:         /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.13-k3s1/etcdserver/backend.go:75 +0x16b
Jul 22 11:57:04 ip-172-23-71-220 systemd[1]: rke2-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

3.

Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: time="2024-07-22T11:39:05Z" level=info msg="Applying Pod Security Admission Configuration"
Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: time="2024-07-22T11:39:05Z" level=info msg="ETCD snapshots are disabled"
Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: time="2024-07-22T11:39:05Z" level=info msg="Starting rke2 v1.28.11+rke2r1 (6b12d7a783238b72da8450fa1b6ec587cebb79ed)"
Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: time="2024-07-22T11:39:05Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: time="2024-07-22T11:39:05Z" level=info msg="Starting temporary etcd to reconcile with datastore"
Jul 22 11:39:05 ip-172-23-71-220 rke2[863464]: W0722 11:39:05.802104  863464 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2399", ServerName: "127.0.0.1:2399", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2399: connect: connection refused"
Jul 22 11:39:06 ip-172-23-71-220 rke2[863464]: W0722 11:39:06.802539  863464 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2399", ServerName: "127.0.0.1:2399", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2399: connect: connection refused"
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: {"level":"info","ts":"2024-07-22T11:39:07.431053Z","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: {"level":"info","ts":"2024-07-22T11:39:07.431356Z","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: {"level":"info","ts":"2024-07-22T11:39:07.431498Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.21.11 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":32,"max-cpu-available":32,"member-initialized":true,"name":"ip-172-23-71-220.eu-central-1.compute.internal-a5c8494c","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":0,"max-snapshots":0,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: panic: freepages: failed to get all reachable pages (page 3452: multiple references (stack: [427454 426752 3452]))
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: goroutine 263 [running]:
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: go.etcd.io/bbolt.(*DB).freepages.func2()
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:1202 +0x8d
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]: created by go.etcd.io/bbolt.(*DB).freepages in goroutine 262
Jul 22 11:39:07 ip-172-23-71-220 rke2[863464]:         /go/pkg/mod/go.etcd.io/bbolt@v1.3.9/db.go:1200 +0x1e5
Jul 22 11:39:07 ip-172-23-71-220 systemd[1]: rke2-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

full rke2-server log: https://gist.github.com/riuvshyn/05755e51a72694d2abb503aa0915b1bb

riuvshyn commented 4 months ago

UPD: These errors during triggering snapshot creation start appearing when DB grows to 500MB with database 441MB this doesn't not happen and snapshot creation goes through without issues.

Also, creating snapshots of 1.6G DB with rke2 cli works fine without any issues:

root@ip-172-23-74-105:~# rke2 etcd-snapshot save
INFO[0032] Snapshot on-demand-ip-172-23-74-105-1721661041 saved.
INFO[0032] Snapshot on-demand-ip-172-23-74-105-1721661041 saved.
brandond commented 4 months ago

It really sounds like you're running out of disk space, and the database files on disk are getting corrupted. The errors you're showing are regarding the internal checkpoint snapshots within the database, NOT the snapshots that are taken for backups. I'm also confused why RKE2 is being restarted when you try to save a snapshot. All of these things point to rke2 crashing due to issues with the node, and then etcd crashing AGAIN on startup due to the files being corrupt.

Can you confirm how much free space you have on the node in question? What kind of storage are you using here?

riuvshyn commented 4 months ago

Thanks for looking in to this @brandond

It really sounds like you're running out of disk space, and the database files on disk are getting corrupted. The errors you're showing are regarding the internal checkpoint snapshots within the database, NOT the snapshots that are taken for backups. I'm also confused why RKE2 is being restarted when you try to save a snapshot. All of these things point to rke2 crashing due to issues with the node, and then etcd crashing AGAIN on startup due to the files being corrupt.

This indeed sounds about right, I have thought that rke2-server restart is expected thing to happen during creating snapshot, if it is not then I clearly have issue going on here... btw I think rke2-server is not restarting when I create snapshot with database size <400MB and everything works fine.

There is plenty of space available:

Filesystem       Size  Used Avail Use% Mounted on
/dev/root        146G   16G  131G  11% /

I have default configuration for etcd, so wal files are stored in: /var/lib/rancher/rke2/server/db/etcd/member/wal

What I have also noticed, I saw in rke2-server logs it is running defrag which fails sometimes and I have tried to run it with etcdctl and I see pretty similar picture where with DB size <400MB defgrad works fine:

root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw defrag --cluster true
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------

Finished defragmenting etcd member[https://172.23.72.77:2379]
Finished defragmenting etcd member[https://172.23.71.127:2379]
Finished defragmenting etcd member[https://172.23.70.32:2379]
Finished defragmenting etcd member[https://172.23.72.109:2379]
Finished defragmenting etcd member[https://172.23.70.159:2379]
Finished defragmenting etcd member[https://172.23.71.196:2379]
root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw endpoint-status
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  https://172.23.72.77:2379 | 3090a09999c2130b |  3.5.13 |  387 MB |      true |      false |         2 |     213186 |             213186 |        |
| https://172.23.71.127:2379 | 54ba414cb1ac9001 |  3.5.13 |  387 MB |     false |      false |         2 |     213187 |             213187 |        |
|  https://172.23.70.32:2379 | 68c823547be4c89c |  3.5.13 |  388 MB |     false |      false |         2 |     213187 |             213187 |        |
| https://172.23.72.109:2379 | c64eed6dfe68e21d |  3.5.13 |  370 MB |     false |      false |         2 |     213188 |             213188 |        |
| https://172.23.70.159:2379 | d8ed66ceb9e2f7af |  3.5.13 |  333 MB |     false |      false |         2 |     213188 |             213188 |        |
| https://172.23.71.196:2379 | f0e7829c9749c886 |  3.5.13 |  333 MB |     false |      false |         2 |     213189 |             213189 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

But when DB size grows more it start timing out:

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|  https://172.23.72.77:2379 | 3090a09999c2130b |  3.5.13 |  610 MB |      true |      false |         2 |     381990 |             381990 |        |
| https://172.23.71.127:2379 | 54ba414cb1ac9001 |  3.5.13 |  615 MB |     false |      false |         2 |     381990 |             381990 |        |
|  https://172.23.70.32:2379 | 68c823547be4c89c |  3.5.13 |  598 MB |     false |      false |         2 |     381990 |             381990 |        |
| https://172.23.72.109:2379 | c64eed6dfe68e21d |  3.5.13 |  607 MB |     false |      false |         2 |     381990 |             381990 |        |
| https://172.23.70.159:2379 | d8ed66ceb9e2f7af |  3.5.13 |  600 MB |     false |      false |         2 |     381990 |             381990 |        |
| https://172.23.71.196:2379 | f0e7829c9749c886 |  3.5.13 |  596 MB |     false |      false |         2 |     381990 |             381990 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw defrag --cluster true
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------

{"level":"warn","ts":"2024-07-22T18:29:24.112434Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.72.77:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:29.11329Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.71.127:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:34.117321Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.70.32:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:39.121301Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.72.109:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:44.1253Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.70.159:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:49.129294Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.71.196:2379] (context deadline exceeded)
FATA[0030] execing command in container: command terminated with exit code 1

So, in this case there is nothing todo with RKE2 or rancher I just can't run defrag on etcd when it reaches > 400+MB in size

I am running out of something somewhere but I can't find out what it is

riuvshyn commented 4 months ago

in etcd logs btw:

{"level":"warn","ts":"2024-07-22T19:13:56.869278Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.000422535s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}
brandond commented 4 months ago

Yeah, your disk is too slow. I asked earlier what sort of disks these are and didn't see a response. Etcd really needs fast SSD or other non-rotational storage to perform well.

Also, don't manually defrag the datastore. This pauses all IO while the files are rewritten, and is VERY disruptive to the cluster. RKE2 automatically defrags each cluster member when they are starting, before the apiserver is using them to serve traffic.

riuvshyn commented 4 months ago

I was looking at this: https://github.com/k3s-io/k3s/discussions/9207 and this looks somewhat similar

riuvshyn commented 4 months ago

Yeah, your disk is too slow. I asked earlier what sort of disks these are and didn't see a response. Etcd really needs fast SSD or other non-rotational storage to perform well.

Oh, I have missed that. I am using GP3 volume:

image

Also, don't manually defrag the datastore. This pauses all IO while the files are rewritten, and is VERY disruptive to the cluster. RKE2 automatically defrags each cluster member when they are starting, before the apiserver is using them to serve traffic.

Yeah, thanks, I was doing this on test clusters just to validate.

brandond commented 4 months ago

That should generally be OK. Do you have other IO intensive tasks using these same disks? What do the cloudwatch metrics for these disks show when you get IO latency warnings from etcd?

riuvshyn commented 4 months ago

I think I have managed to improve things by using separate volume for etcd with more IOPS and Throughput. So, indeed storage performance was part of the problem.

Also I have noticed following errors on clustres with database size> 3-4GB:

msg="Failed to process config: failed to process /var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml: failed to list /v1, Kind=ConfigMap for  kube-system/rke2-etcd-snapshot-extra-metadata: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (4063225176 vs. 2147483647)"

@brandond do you have any idea if this limit is configurable anywhere?

Also I have noticed that etcd snapshots are created only on init node, while I was under impression that etcd snapshot supposed to be created on every node with etcd role. I have also noticed your comment here https://github.com/rancher/rke2/issues/6283#issuecomment-2214818000 mentioning same thing.

But in my case I have single node creating 3 snapshots instead of each node creating it's own:

image

Not sure if that a might be a missconfiguration or behaviour change in rancher.

riuvshyn commented 4 months ago

Oh, ignore

msg="Failed to process config: failed to process /var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml: failed to list /v1, Kind=ConfigMap for  kube-system/rke2-etcd-snapshot-extra-metadata: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (4063225176 vs. 2147483647)"

This is happening i believe because I am actually using large config maps to inflate DB size and when it gets to more than 2GB grpc client hitting 2GB limit.

Still would be nice to know if that is configureable somewhere, I understand that this is not common usage pattern but definitely not something unrealistic.

brandond commented 4 months ago

I have noticed that etcd snapshots are created only on init node

This is https://github.com/rancher/rke2/issues/6325

This is happening i believe because I am actually using large config maps to inflate DB size and when it gets to more than 2GB grpc client hitting 2GB limit.

Yeah I can't say I've seen that before, usually you get a "Request entity too large" error from the apiserver when the ConfigMap is >1MB, before you will run into problems with grpc message size.

Etcd also has a maximum database size that defaults to 2GB. You can raise that up as far as 8GB with etcd args. If you exceed that the datastore will go read-only though, instead of giving you RPC errors.

riuvshyn commented 4 months ago

oh, that's good to know about https://github.com/rancher/rke2/issues/6325 thanks for sharing it.

Yeah I can't say I've seen that before, usually you get a "Request entity too large" error from the apiserver when the ConfigMap is >1MB, before you will run into problems with grpc message size. Etcd also has a maximum database size that defaults to 2GB. You can raise that up as far as 8GB with etcd args. If you exceed that the datastore will go read-only though, instead of giving you RPC errors.

I also thought initially that I am hitting either max resource size or etcd db size limit but apparently it is not. I have 8GB for etcd quota-backend-bytes

... failed to list /v1, Kind=ConfigMap ...

I believe it is trying to get/list all config maps or something and total size of them is more than 2GB

trying to send message larger than max (4063225176 vs. 2147483647)

In this case I indeed had 4GB of config maps.

this looks like some grpc client has 2gb limit size which cause rke2-server to fail to process config.

brandond commented 4 months ago

I'm not sure why it would be trying to list all the configmaps just to patch that one, that is pretty odd. This appears to be in the deploy controller (wrangler desiredset/apply) that manages deployment of manifests from the /var/lib/rancher/rke2/server/manifests dir though, not the snapshot controller itself.

Did you put all 4GB of configmaps in that one file!?

riuvshyn commented 4 months ago

Did you put all 4GB of configmaps in that one file!?

No, no I just filled up cluster with lots of 500kb configmaps to inflate db size to be able to reproduce issues i had when etcd db was larger than 400+MB. So using separate volume for etcd seems helped me with issues I had originally then I thought to see what happens if I will inflate db even more. so then I reached 3GB+ (mostly by creating config maps) I started noticing that error.

Also, I was creating config maps with just kubectl create -f ... not using /var/lib/rancher/rke2/server/manifests dir

riuvshyn commented 4 months ago

Just sharing it here perhaps this could be helpful for someone struggling with same issues: We have managed to get best etcd perf results using aws instance type with NVMe-based SSDs and use it exclusively for etcd data. This helped to get rid of rke2-server panics we had initially.

Also looking forward for a fix https://github.com/rancher/rke2/issues/6325 as I believe it cause rke2-server failures when snapshot creation was initialised:

Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=error msg="etcd-snapshot error ID 48259: snapshot save already in progress"
Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=error msg="Sending HTTP 500 response to 127.0.0.1:58810: etcd-snapshot error ID 48259"
Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=debug msg="Wrote ping"
Jul 25 14:10:46 ip-172-23-72-134 systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Proxy done" err="context canceled" url="wss://172.23.70.107:9345/v1-rke2/connect"
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Shutting down k3s.cattle.io/v1, Kind=Addon workers"
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Shutting down /v1, Kind=ServiceAccount workers"
Jul 25 14:10:46 ip-172-23-72-134 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE

Especially when db is quite large and it takes about 1min to create snapshot rke2-server keep trying initiate new snapshot opearation while previous is still running...

As a side note, we wanted to migrate to rancher managed controlplane nodes with machine_pools but apparently the is no way to configure custom volume mount with current version of terraform provider on rancher2_machine_config_v2 resource. It is not even possible to tune IOPS / Throughput for the root, so we had to stick with AWS ASG custom controlplane nodes.

I am going to close ticket, as main issue is now addressed.

Thanks @brandond for helping with this one 🙇🏽 🍻

brandond commented 4 months ago

We have managed to get best etcd perf results using aws instance type with NVMe-based SSDs and use it exclusively for etcd data.

This has always been a best practice. etcd should basically never be run on rotational storage, and if your workload is IO intensive you definitely need to put the etcd datastore on a separate disk. etcd calls fsync to flush all outstanding disk IO before acknowledging any datastore write; if your workload is also writing to disk then the workload writes will also need to be synced before the etcd transaction can be completed.

Especially when db is quite large and it takes about 1min to create snapshot rke2-server keep trying initiate new snapshot opearation while previous is still running...

rke2 doesn't retry the snapshot operations. Additionally, your logs show that something external is actively stopping the rke2-server service. I suspect that both of these are being done by rancher-system-agent. You might check the logs over there to see why it's doing that.

riuvshyn commented 4 months ago

@brandond I have tried to use custom etcd data/wal dirs on rke2 standalone cluster and faced with an issue that blocks snapshot creation:

rke2 etcd-snapshot save
...
FATA[0000] etcd database not found in /var/lib/rancher/rke2/server

In rancher managed cluster it works fine. Rancher config is almost identical so I am abit confused why it does not work with rke2 standalone.

It doesn't look like there is a flag to configure rke2 cli for a custom etcd data path either.

Also I did check /var/lib/rancher/rke2/server/db/etcd/config which is actually pointing to the right data-dir: /opt/kubernetes/etcd/data and wal-dir: /opt/kubernetes/etcd/wal

Do you have any idea how can I configure rke2 standalone cli to be able to create snapshot with custom etcd data path? 🙏🏽

riuvshyn commented 4 months ago

Oh, I think it might be happening because I am using rke2 standalone 1.27.6, going to try with 1.28.x

Update: it worked fine with 1.28.11.