rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.43k stars 255 forks source link

Unable to cluster-reset-restore on node previously deleted from cluster with annotation #5033

Open brandond opened 7 months ago

brandond commented 7 months ago

RKE2 tracking issue for

VestigeJ commented 5 months ago

@brandond and I chatted about this one going back to the working pile for a future milestone last week.

brandond commented 5 months ago

@VestigeJ can you confirm what you're seeing in the most recent release?

VestigeJ commented 3 months ago

Probably to be expected that the restore behavior has changed since the original issue but despite the systemd service being up and running remains unstable/unusable(?) as calls to the api are refused/ignored and the etcd server also refuses a direct conntection.

● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; disabled; vendor preset: disabled)
     Active: active (running) since Tue 2024-03-19 18:51:09 UTC; 10s ago
       Docs: https://github.com/rancher/rke2#readme
    Process: 3417 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 3419 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 3420 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3421 (rke2)
      Tasks: 88
     CGroup: /system.slice/rke2-server.service
             ├─ 3421 "/usr/local/bin/rke2 server"
             ├─ 3441 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
             ├─ 3451 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock "--eviction-hard=imagefs.available<5%,nodefs.available<5%" --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --healthz-bind-address=127.0.0.1 --hostname-override=ip-1-1-1-164 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --node-ip=1.1.1.164 --node-labels= --pod-infra-container-image=index.docker.io/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key
             ├─ 3484 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id e691b96434855838b31f9bbe24e711de7a11657a8f59f9f05dcce34db993573e -address /run/k3s/containerd/containerd.sock
             ├─ 3630 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 55992be7f13d8acb699e18f0ea8bda8e9ebf0c591851580aa91bb3bbe778b8ac -address /run/k3s/containerd/containerd.sock
             ├─ 3737 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 4e0331dfdd736d69c691a6e3379ba498b316690b6597314d5c5c3d77ce17b56b -address /run/k3s/containerd/containerd.sock
             ├─ 3783 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1e6c1b49c189a99ed7cd7d3033268a7879e639c7925db390868addafb62169a0 -address /run/k3s/containerd/containerd.sock
             └─ 4032 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id bc4774fc8a3549e71ec5de10e2496becc7a4a4779cf5cf6b8df9fde56f9f2597 -address /run/k3s/containerd/containerd.sock

level=debug msg="Node ip-1-1-1-74 is changing etcd status condition"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-204 IP 1.1.1.204/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-204 IP 1.1.1.60/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-74  IP 1.1.1.74/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-74  IP 1.1.1.233/32"
level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
level=info msg="Stopped tunnel to 1.1.1.233:9345"
level=info msg="Stopped tunnel to 1.1.1.60:9345"
level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
level=debug msg="Wrote ping"

$ get_etcd

{"level":"warn","ts":"2024-03-19T18:52:17.064Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000360a80/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded

$ kgn -v9

I0319 19:00:04.670783   19405 loader.go:395] Config loaded from file:  /etc/rancher/rke2/rke2.yaml
I0319 19:00:04.678590   19405 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.29.2+rke2r1 (linux/amd64) kubernetes/4b8e819" 'https://127.0.0.1:6443/api/v1/nodes?limit=500'
I0319 19:00:04.678933   19405 round_trippers.go:508] HTTP Trace: Dial to tcp:127.0.0.1:6443 failed: dial tcp 127.0.0.1:6443: connect: connection refused
I0319 19:00:04.678977   19405 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/nodes?limit=500  in 0 milliseconds
I0319 19:00:04.678996   19405 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 0 ms Duration 0 ms
I0319 19:00:04.679005   19405 round_trippers.go:577] Response Headers:
I0319 19:00:04.679084   19405 helpers.go:264] Connection error: Get https://127.0.0.1:6443/api/v1/nodes?limit=500: dial tcp 127.0.0.1:6443: connect: connection refused
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
brandond commented 3 months ago

Can you grab the etcd pod logs?

brandond commented 2 months ago

have not done any further investigation