rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.54k stars 266 forks source link

RKE2 can't restore snapshot, fail with "snap: snapshot file doesn't exist" #3719

Closed ldominguezBold closed 1 year ago

ldominguezBold commented 1 year ago

Environmental Info: RKE2 Version:

rke2 version v1.24.8+rke2r1 (b061438ce1f04518d980aac842b62f92095d793d)
go version go1.18.1b7

Node(s) CPU architecture, OS, and Version:

Linux myhost 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

I tried to restore a snapshot with this command:

rke2-killall.sh
rke2 server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-******

Steps To Reproduce:

Expected behavior:

Expected that restore end like docs say.

Actual behavior:

The process end with an error:

{"level":"warn","ts":"2022-12-20T18:51:02.995Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000029-000000000034ead1.wal.broken"}
{"level":"info","ts":"2022-12-20T18:51:03.819Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":5900644,"snapshot-size":"8.1 kB"}
{"level":"warn","ts":"2022-12-20T18:51:03.821Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5900644,"snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/00000000005a0964.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-12-20T18:51:03.821Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/embed/etcd.go:245\ngithub.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/executor/etcd.go:30\ngithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/etcd/etcd.go:862\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:523\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:63\ngithub.com/k3s-io/k3s/pkg/daemons/control.prepare\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:265\ngithub.com/k3s-io/k3s/pkg/daemons/control.Server\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:34\ngithub.com/k3s-io/k3s/pkg/server.StartServer\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/server/server.go:58\ngithub.com/k3s-io/k3s/pkg/cli/server.run\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:455\ngithub.com/k3s-io/k3s/pkg/cli/server.RunWithControllers\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:46\ngithub.com/rancher/rke2/pkg/rke2.Server\n\t/source/pkg/rke2/rke2.go:122\ngithub.com/rancher/rke2/pkg/cli/cmds.ServerRun\n\t/source/pkg/cli/cmds/server.go:161\ngithub.com/urfave/cli.HandleAction\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:524\ngithub.com/urfave/cli.Command.Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/command.go:173\ngithub.com/urfave/cli.(*App).Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:277\nmain.main\n\t/source/main.go:22\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc001172000, {0xc000fce480, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.19.0/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*Logger).Panic(0xc001321aa0?, {0x69001fd?, 0x0?}, {0xc000fce480, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.19.0/logger.go:230 +0x59
go.etcd.io/etcd/server/v3/etcdserver.NewServer({{0xc001321aa0, 0x25}, {0x0, 0x0}, {0x0, 0x0}, {0xc000a74990, 0x1, 0x1}, {0xc000ccde60, ...}, ...})
    /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/etcdserver/server.go:515 +0x1d85
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0008ab100)
    /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/embed/etcd.go:245 +0x1158
github.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD(_, {_, _}, {{{0xc000677de8, 0x15}, {0x0, 0x0}, {0x0, 0x0}}, {0xc00047f410, ...}, ...}, ...)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/executor/etcd.go:30 +0x10e
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary(0xc000bba190, {0x745bdd0, 0xc000cd0e40})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/etcd/etcd.go:862 +0x53c
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd(0xc00062c3f0, {0x745bdd0, 0xc000cd0d40})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:523 +0x2e5
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap(0xc00062c3f0, {0x745bdd0, 0xc000cd0d40}, 0x0)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:63 +0x1e5
github.com/k3s-io/k3s/pkg/daemons/control.prepare({0x745bdd0, 0xc000cd0d40}, 0xc00107ca08)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:265 +0x245
github.com/k3s-io/k3s/pkg/daemons/control.Server({0x745bdd0?, 0xc000cd0d40}, 0xc00107ca08)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:34 +0xff
github.com/k3s-io/k3s/pkg/server.StartServer({0x745bdd0?, 0xc000cd0d40}, 0xc00107ca00, 0x1?)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/server/server.go:58 +0x93
github.com/k3s-io/k3s/pkg/cli/server.run(0xc00116b1e0, 0xa7eb2a0, {0x0, 0x0, 0x9?}, {0xc0012be870, 0x0, 0x0?})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:455 +0x29c5
github.com/k3s-io/k3s/pkg/cli/server.RunWithControllers(...)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:46
github.com/rancher/rke2/pkg/rke2.Server(_, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {{0x0, 0x0}, {0x0, ...}, ...}, ...})
    /source/pkg/rke2/rke2.go:122 +0x525
github.com/rancher/rke2/pkg/cli/cmds.ServerRun(0xc00116b1e0?)
    /source/pkg/cli/cmds/server.go:161 +0xa5
github.com/urfave/cli.HandleAction({0x57e8860?, 0x6adfed8?}, 0x6?)
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:524 +0xa8
github.com/urfave/cli.Command.Run({{0x67f8017, 0x6}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x6845f79, 0x15}, {0xc00054eac8, ...}, ...}, ...)
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/command.go:173 +0x652
github.com/urfave/cli.(*App).Run(0xc0011701c0, {0xc00111f180, 0x6, 0x8})
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:277 +0x8a7
main.main()
    /source/main.go:22 +0xb9b

Additional context / logs:

WARN[0000] not running in CIS mode                      
INFO[0000] Running temporary containerd /var/lib/rancher/rke2/bin/containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd 
WARN[0000] containerd config version `1` has been deprecated and will be removed in containerd v2.0, please switch to version `2`, see https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#version-header 
INFO[2022-12-20T18:50:52.581714121Z] starting containerd                           revision=d55db719497a2cb86cfcfc15ceb60f0df683ec6d version=v1.6.8-k3s1
INFO[2022-12-20T18:50:52.612645775Z] loading plugin "io.containerd.content.v1.content"...  type=io.containerd.content.v1
INFO[2022-12-20T18:50:52.612713771Z] loading plugin "io.containerd.snapshotter.v1.aufs"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.614632151Z] loading plugin "io.containerd.snapshotter.v1.btrfs"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.614847280Z] skip loading plugin "io.containerd.snapshotter.v1.btrfs"...  error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.btrfs (ext4) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.614913294Z] loading plugin "io.containerd.snapshotter.v1.devmapper"...  type=io.containerd.snapshotter.v1
WARN[2022-12-20T18:50:52.614932586Z] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
INFO[2022-12-20T18:50:52.614946827Z] loading plugin "io.containerd.snapshotter.v1.native"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.614973023Z] loading plugin "io.containerd.snapshotter.v1.overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.615080466Z] loading plugin "io.containerd.snapshotter.v1.fuse-overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.615109034Z] loading plugin "io.containerd.snapshotter.v1.stargz"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.615511036Z] loading plugin "io.containerd.snapshotter.v1.zfs"...  type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.615669794Z] skip loading plugin "io.containerd.snapshotter.v1.zfs"...  error="path /var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2022-12-20T18:50:52.615693669Z] loading plugin "io.containerd.metadata.v1.bolt"...  type=io.containerd.metadata.v1
WARN[2022-12-20T18:50:52.615715065Z] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2022-12-20T18:50:52.615729517Z] metadata content store policy set             policy=shared
INFO[2022-12-20T18:50:52.615960496Z] loading plugin "io.containerd.differ.v1.walking"...  type=io.containerd.differ.v1
INFO[2022-12-20T18:50:52.615983243Z] loading plugin "io.containerd.event.v1.exchange"...  type=io.containerd.event.v1
INFO[2022-12-20T18:50:52.615999810Z] loading plugin "io.containerd.gc.v1.scheduler"...  type=io.containerd.gc.v1
INFO[2022-12-20T18:50:52.616030344Z] loading plugin "io.containerd.service.v1.introspection-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616051740Z] loading plugin "io.containerd.service.v1.containers-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616070531Z] loading plugin "io.containerd.service.v1.content-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616087027Z] loading plugin "io.containerd.service.v1.diff-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616105255Z] loading plugin "io.containerd.service.v1.images-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616122077Z] loading plugin "io.containerd.service.v1.leases-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616139284Z] loading plugin "io.containerd.service.v1.namespaces-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616157263Z] loading plugin "io.containerd.service.v1.snapshots-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616174654Z] loading plugin "io.containerd.runtime.v1.linux"...  type=io.containerd.runtime.v1
INFO[2022-12-20T18:50:52.616214823Z] loading plugin "io.containerd.runtime.v2.task"...  type=io.containerd.runtime.v2
INFO[2022-12-20T18:50:52.616265590Z] loading plugin "io.containerd.monitor.v1.cgroups"...  type=io.containerd.monitor.v1
INFO[2022-12-20T18:50:52.616616794Z] loading plugin "io.containerd.service.v1.tasks-service"...  type=io.containerd.service.v1
INFO[2022-12-20T18:50:52.616647574Z] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616665712Z] loading plugin "io.containerd.internal.v1.restart"...  type=io.containerd.internal.v1
INFO[2022-12-20T18:50:52.616746972Z] loading plugin "io.containerd.grpc.v1.containers"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616766032Z] loading plugin "io.containerd.grpc.v1.content"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616785039Z] loading plugin "io.containerd.grpc.v1.diff"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616801815Z] loading plugin "io.containerd.grpc.v1.events"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616816051Z] loading plugin "io.containerd.grpc.v1.healthcheck"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616834896Z] loading plugin "io.containerd.grpc.v1.images"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616851246Z] loading plugin "io.containerd.grpc.v1.leases"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616867402Z] loading plugin "io.containerd.grpc.v1.namespaces"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616884918Z] loading plugin "io.containerd.internal.v1.opt"...  type=io.containerd.internal.v1
INFO[2022-12-20T18:50:52.616941292Z] loading plugin "io.containerd.grpc.v1.snapshots"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616961512Z] loading plugin "io.containerd.grpc.v1.tasks"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616980152Z] loading plugin "io.containerd.grpc.v1.version"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.616996886Z] loading plugin "io.containerd.tracing.processor.v1.otlp"...  type=io.containerd.tracing.processor.v1
INFO[2022-12-20T18:50:52.617015762Z] skip loading plugin "io.containerd.tracing.processor.v1.otlp"...  error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
INFO[2022-12-20T18:50:52.617030423Z] loading plugin "io.containerd.internal.v1.tracing"...  type=io.containerd.internal.v1
ERRO[2022-12-20T18:50:52.617388146Z] failed to initialize a tracing processor "otlp"  error="no OpenTelemetry endpoint: skip plugin"
INFO[2022-12-20T18:50:52.617468809Z] loading plugin "io.containerd.grpc.v1.cri"...  type=io.containerd.grpc.v1
INFO[2022-12-20T18:50:52.618154298Z] Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapshotter:overlayfs DefaultRuntimeName:runc DefaultRuntime:{Type: Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[] PrivilegedWithoutHostDevices:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0} UntrustedWorkloadRuntime:{Type: Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[] PrivilegedWithoutHostDevices:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0} Runtimes:map[runc:{Type:io.containerd.runc.v2 Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[SystemdCgroup:true] PrivilegedWithoutHostDevices:false BaseRuntimeSpec: NetworkPluginConfDir: NetworkPluginMaxConfNum:0}] NoPivot:false DisableSnapshotAnnotations:true DiscardUnpackedLayers:false IgnoreRdtNotEnabledErrors:false} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginMaxConfNum:1 NetworkPluginConfTemplate: IPPreference:} Registry:{ConfigPath: Mirrors:map[] Configs:map[] Auths:map[] Headers:map[]} ImageDecryption:{KeyModel:node} DisableTCPService:true StreamServerAddress:127.0.0.1 StreamServerPort:10010 StreamIdleTimeout:4h0m0s EnableSelinux:false SelinuxCategoryRange:1024 SandboxImage:index.docker.io/rancher/pause:3.6 StatsCollectPeriod:10 SystemdCgroup:false EnableTLSStreaming:false X509KeyPairStreaming:{TLSCertFile: TLSKeyFile:} MaxContainerLogLineSize:16384 DisableCgroup:false DisableApparmor:false RestrictOOMScoreAdj:false MaxConcurrentDownloads:3 DisableProcMount:false UnsetSeccompProfile: TolerateMissingHugetlbController:true DisableHugetlbController:true DeviceOwnershipFromSecurityContext:false IgnoreImageDefinedVolumes:false NetNSMountsUnderStateDir:false EnableUnprivilegedPorts:true EnableUnprivilegedICMP:true} ContainerdRootDir:/var/lib/rancher/rke2/agent/containerd ContainerdEndpoint:/run/k3s/containerd/containerd.sock RootDir:/var/lib/rancher/rke2/agent/containerd/io.containerd.grpc.v1.cri StateDir:/run/k3s/containerd/io.containerd.grpc.v1.cri} 
INFO[2022-12-20T18:50:52.618217396Z] Connect containerd service                   
INFO[2022-12-20T18:50:52.618262659Z] Get image filesystem path "/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs" 
INFO[2022-12-20T18:50:52.618831614Z] Start subscribing containerd event           
INFO[2022-12-20T18:50:52.618903881Z] Start recovering state                       
INFO[2022-12-20T18:50:52.618960483Z] serving...                                    address=/run/k3s/containerd/containerd.sock.ttrpc
INFO[2022-12-20T18:50:52.619020012Z] serving...                                    address=/run/k3s/containerd/containerd.sock
INFO[2022-12-20T18:50:52.619048144Z] containerd successfully booted in 0.037917s  
INFO[2022-12-20T18:50:52.860926630Z] Start event monitor                          
INFO[2022-12-20T18:50:52.860961705Z] Start snapshots syncer                       
INFO[2022-12-20T18:50:52.860975079Z] Start cni network conf syncer for default    
INFO[2022-12-20T18:50:52.860984788Z] Start streaming server                       
INFO[0001] Containerd is now running                    
INFO[0001] Running temporary kubelet /var/lib/rancher/rke2/bin/kubelet --cgroup-driver=cgroupfs --container-runtime=remote --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests 
Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --container-runtime has been deprecated, will be removed in 1.27 as the only valid value is 'remote'
Flag --eviction-hard has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --eviction-minimum-reclaim has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --fail-swap-on has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
I1220 18:50:53.664478   11077 server.go:399] "Kubelet version" kubeletVersion="v1.24.6+rke2r1"
I1220 18:50:53.664539   11077 server.go:401] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1220 18:50:53.664948   11077 server.go:562] "Standalone mode, no API client"
I1220 18:50:53.665091   11077 server.go:618] "Failed to get the kubelet's cgroup. Kubelet system container metrics may be missing." err="cpu and memory cgroup hierarchy not unified.  cpu: /user.slice, memory: /user.slice/user-1000.slice/session-6.scope"
I1220 18:50:53.665121   11077 server.go:624] "Failed to get the container runtime's cgroup. Runtime system container metrics may be missing." err="cpu and memory cgroup hierarchy not unified.  cpu: /user.slice, memory: /user.slice/user-1000.slice/session-6.scope"
I1220 18:50:53.671051   11077 server.go:450] "No api server defined - no events will be sent to API server"
I1220 18:50:53.671091   11077 server.go:648] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
I1220 18:50:53.671421   11077 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I1220 18:50:53.671530   11077 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
I1220 18:50:53.671568   11077 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
I1220 18:50:53.671584   11077 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
I1220 18:50:53.671621   11077 state_mem.go:36] "Initialized new in-memory state store"
I1220 18:50:53.681827   11077 kubelet.go:382] "Kubelet is running in standalone mode, will skip API server sync"
I1220 18:50:53.681884   11077 kubelet.go:267] "Adding static pod path" path="/var/lib/rancher/rke2/agent/pod-manifests"
I1220 18:50:53.683260   11077 kuberuntime_manager.go:239] "Container runtime initialized" containerRuntime="containerd" version="v1.6.8-k3s1" apiVersion="v1"
I1220 18:50:53.683447   11077 volume_host.go:75] "KubeClient is nil. Skip initialization of CSIDriverLister"
W1220 18:50:53.683618   11077 csi_plugin.go:189] kubernetes.io/csi: kubeclient not set, assuming standalone kubelet
W1220 18:50:53.683637   11077 csi_plugin.go:268] Skipping CSINode initialization, kubelet running in standalone mode
I1220 18:50:53.683846   11077 server.go:1181] "Started kubelet"
I1220 18:50:53.683909   11077 kubelet.go:1408] "No API server defined - no node status update will be sent"
I1220 18:50:53.684024   11077 server.go:150] "Starting to listen" address="0.0.0.0" port=10250
I1220 18:50:53.684026   11077 server.go:177] "Starting to listen read-only" address="0.0.0.0" port=10255
E1220 18:50:53.684215   11077 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/rancher/rke2/agent/containerd/io.containerd.snapshotter.v1.overlayfs"
E1220 18:50:53.684246   11077 kubelet.go:1298] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I1220 18:50:53.685272   11077 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I1220 18:50:53.685349   11077 server.go:410] "Adding debug handlers to kubelet server"
I1220 18:50:53.685403   11077 volume_manager.go:289] "Starting Kubelet Volume Manager"
I1220 18:50:53.685450   11077 desired_state_of_world_populator.go:145] "Desired state populator starts to run"
I1220 18:50:53.718964   11077 kubelet_network_linux.go:76] "Initialized protocol iptables rules." protocol=IPv4
I1220 18:50:53.753341   11077 cpu_manager.go:213] "Starting CPU manager" policy="none"
I1220 18:50:53.753362   11077 cpu_manager.go:214] "Reconciling" reconcilePeriod="10s"
I1220 18:50:53.753376   11077 state_mem.go:36] "Initialized new in-memory state store"
I1220 18:50:53.753545   11077 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I1220 18:50:53.753560   11077 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
I1220 18:50:53.753570   11077 policy_none.go:49] "None policy: Start"
I1220 18:50:53.754663   11077 memory_manager.go:168] "Starting memorymanager" policy="None"
I1220 18:50:53.754726   11077 state_mem.go:35] "Initializing new in-memory state store"
I1220 18:50:53.755228   11077 state_mem.go:75] "Updated machine memory state"
I1220 18:50:53.758071   11077 manager.go:610] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
E1220 18:50:53.758265   11077 container_manager_linux.go:510] "Failed to find cgroups of kubelet" err="cpu and memory cgroup hierarchy not unified.  cpu: /user.slice, memory: /user.slice/user-1000.slice/session-6.scope"
I1220 18:50:53.758634   11077 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
I1220 18:50:53.848624   11077 kubelet_network_linux.go:76] "Initialized protocol iptables rules." protocol=IPv6
I1220 18:50:53.848656   11077 status_manager.go:157] "Kubernetes client is nil, not starting status manager"
I1220 18:50:53.848671   11077 kubelet.go:1986] "Starting kubelet main sync loop"
E1220 18:50:53.848729   11077 kubelet.go:2010] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
I1220 18:50:53.949834   11077 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="38d693b24c553c6f60c8f7d4882ee163f5a054ce3f19e6399cc4e1c0c8c4f392"
I1220 18:50:53.949893   11077 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="5ecbe7548f63db043ca51fd6e34f8c06819b506a18444f1b6b69ce6014a779d4"
I1220 18:50:53.949922   11077 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="23d223c7eb5da3220f4d418917f76097d965ee4abc1b1a058a60a89b36e49cfd"
I1220 18:50:53.988356   11077 reconciler.go:159] "Reconciler: start to sync state"
INFO[0010] Static pod cleanup completed successfully    
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json: no such file or directory 
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json: no such file or directory 
INFO[0010] Starting rke2 v1.24.8+rke2r1 (b061438ce1f04518d980aac842b62f92095d793d) 
INFO[0010] Managed etcd cluster bootstrap already complete and initialized 
INFO[0010] Starting temporary etcd to reconcile with datastore 
{"level":"info","ts":"2022-12-20T18:51:02.992Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://127.0.0.1:2400"]}
{"level":"info","ts":"2022-12-20T18:51:02.993Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:2399"]}
{"level":"info","ts":"2022-12-20T18:51:02.993Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.18.1b7","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"myhost-8fada3bc","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp/member","force-new-cluster":true,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2022-12-20T18:51:02.995Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"2.034712ms"}
{"level":"warn","ts":"2022-12-20T18:51:02.995Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000029-000000000034ead1.wal.broken"}
{"level":"info","ts":"2022-12-20T18:51:03.819Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":5900644,"snapshot-size":"8.1 kB"}
{"level":"warn","ts":"2022-12-20T18:51:03.821Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":5900644,"snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/00000000005a0964.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-12-20T18:51:03.821Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/embed/etcd.go:245\ngithub.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/executor/etcd.go:30\ngithub.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/etcd/etcd.go:862\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:523\ngithub.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:63\ngithub.com/k3s-io/k3s/pkg/daemons/control.prepare\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:265\ngithub.com/k3s-io/k3s/pkg/daemons/control.Server\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:34\ngithub.com/k3s-io/k3s/pkg/server.StartServer\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/server/server.go:58\ngithub.com/k3s-io/k3s/pkg/cli/server.run\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:455\ngithub.com/k3s-io/k3s/pkg/cli/server.RunWithControllers\n\t/go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:46\ngithub.com/rancher/rke2/pkg/rke2.Server\n\t/source/pkg/rke2/rke2.go:122\ngithub.com/rancher/rke2/pkg/cli/cmds.ServerRun\n\t/source/pkg/cli/cmds/server.go:161\ngithub.com/urfave/cli.HandleAction\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:524\ngithub.com/urfave/cli.Command.Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/command.go:173\ngithub.com/urfave/cli.(*App).Run\n\t/go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:277\nmain.main\n\t/source/main.go:22\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc001172000, {0xc000fce480, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.19.0/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*Logger).Panic(0xc001321aa0?, {0x69001fd?, 0x0?}, {0xc000fce480, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.19.0/logger.go:230 +0x59
go.etcd.io/etcd/server/v3/etcdserver.NewServer({{0xc001321aa0, 0x25}, {0x0, 0x0}, {0x0, 0x0}, {0xc000a74990, 0x1, 0x1}, {0xc000ccde60, ...}, ...})
    /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/etcdserver/server.go:515 +0x1d85
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0008ab100)
    /go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.4-k3s1/embed/etcd.go:245 +0x1158
github.com/k3s-io/k3s/pkg/daemons/executor.(*Embedded).ETCD(_, {_, _}, {{{0xc000677de8, 0x15}, {0x0, 0x0}, {0x0, 0x0}}, {0xc00047f410, ...}, ...}, ...)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/executor/etcd.go:30 +0x10e
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).StartEmbeddedTemporary(0xc000bba190, {0x745bdd0, 0xc000cd0e40})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/etcd/etcd.go:862 +0x53c
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).reconcileEtcd(0xc00062c3f0, {0x745bdd0, 0xc000cd0d40})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:523 +0x2e5
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Bootstrap(0xc00062c3f0, {0x745bdd0, 0xc000cd0d40}, 0x0)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cluster/bootstrap.go:63 +0x1e5
github.com/k3s-io/k3s/pkg/daemons/control.prepare({0x745bdd0, 0xc000cd0d40}, 0xc00107ca08)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:265 +0x245
github.com/k3s-io/k3s/pkg/daemons/control.Server({0x745bdd0?, 0xc000cd0d40}, 0xc00107ca08)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/daemons/control/server.go:34 +0xff
github.com/k3s-io/k3s/pkg/server.StartServer({0x745bdd0?, 0xc000cd0d40}, 0xc00107ca00, 0x1?)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/server/server.go:58 +0x93
github.com/k3s-io/k3s/pkg/cli/server.run(0xc00116b1e0, 0xa7eb2a0, {0x0, 0x0, 0x9?}, {0xc0012be870, 0x0, 0x0?})
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:455 +0x29c5
github.com/k3s-io/k3s/pkg/cli/server.RunWithControllers(...)
    /go/pkg/mod/github.com/k3s-io/k3s@v1.24.8-0.20221104203806-c113444dacbb/pkg/cli/server/server.go:46
github.com/rancher/rke2/pkg/rke2.Server(_, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {{0x0, 0x0}, {0x0, ...}, ...}, ...})
    /source/pkg/rke2/rke2.go:122 +0x525
github.com/rancher/rke2/pkg/cli/cmds.ServerRun(0xc00116b1e0?)
    /source/pkg/cli/cmds/server.go:161 +0xa5
github.com/urfave/cli.HandleAction({0x57e8860?, 0x6adfed8?}, 0x6?)
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:524 +0xa8
github.com/urfave/cli.Command.Run({{0x67f8017, 0x6}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x6845f79, 0x15}, {0xc00054eac8, ...}, ...}, ...)
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/command.go:173 +0x652
github.com/urfave/cli.(*App).Run(0xc0011701c0, {0xc00111f180, 0x6, 0x8})
    /go/pkg/mod/github.com/urfave/cli@v1.22.9/app.go:277 +0x8a7
main.main()
    /source/main.go:22 +0xb9b
brandond commented 1 year ago

your command shows: --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-******

but the error says: "snapshot-file-path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/00000000005a0964.snap.db

Can you confirm that you are providing the path to an actual scheduled or on-demand etcd snapshot, and not one of the temporary WAL checkpoint snapshot files?

ldominguezBold commented 1 year ago

Yeap i can confirm that is one of the scheduled snapshot, not a checkpoint wal

ldominguezBold commented 1 year ago

even that file (without the .db) exists in that directory

brandond commented 1 year ago

Can you try updating to the most recent v1.24 release, and try the same process? If it continues to fail, you may want to try running rke2-killall.sh and removing the old database files (/var/lib/rancher/k3s/server/db/etcd) prior to running the restore. It seems like perhaps the current etcd datastore is corrupt, and that is preventing it from restoring?

Can you also confirm that rke2 IS NOT currently running when you are doing this?

ldominguezBold commented 1 year ago

I will try removing the etcd data dir, because the other actions I was tried already.

ldominguezBold commented 1 year ago

The solution was remove the old etcd data directory, thanks.