Open brandond opened 7 months ago
@brandond and I chatted about this one going back to the working pile for a future milestone last week.
@VestigeJ can you confirm what you're seeing in the most recent release?
Probably to be expected that the restore behavior has changed since the original issue but despite the systemd service being up and running remains unstable/unusable(?) as calls to the api are refused/ignored and the etcd server also refuses a direct conntection.
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2024-03-19 18:51:09 UTC; 10s ago
Docs: https://github.com/rancher/rke2#readme
Process: 3417 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 3419 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 3420 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 3421 (rke2)
Tasks: 88
CGroup: /system.slice/rke2-server.service
├─ 3421 "/usr/local/bin/rke2 server"
├─ 3441 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
├─ 3451 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock "--eviction-hard=imagefs.available<5%,nodefs.available<5%" --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --healthz-bind-address=127.0.0.1 --hostname-override=ip-1-1-1-164 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --node-ip=1.1.1.164 --node-labels= --pod-infra-container-image=index.docker.io/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key
├─ 3484 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id e691b96434855838b31f9bbe24e711de7a11657a8f59f9f05dcce34db993573e -address /run/k3s/containerd/containerd.sock
├─ 3630 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 55992be7f13d8acb699e18f0ea8bda8e9ebf0c591851580aa91bb3bbe778b8ac -address /run/k3s/containerd/containerd.sock
├─ 3737 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 4e0331dfdd736d69c691a6e3379ba498b316690b6597314d5c5c3d77ce17b56b -address /run/k3s/containerd/containerd.sock
├─ 3783 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1e6c1b49c189a99ed7cd7d3033268a7879e639c7925db390868addafb62169a0 -address /run/k3s/containerd/containerd.sock
└─ 4032 /var/lib/rancher/rke2/data/v1.29.2-rke2r1-8bfebc2d9089/bin/containerd-shim-runc-v2 -namespace k8s.io -id bc4774fc8a3549e71ec5de10e2496becc7a4a4779cf5cf6b8df9fde56f9f2597 -address /run/k3s/containerd/containerd.sock
level=debug msg="Node ip-1-1-1-74 is changing etcd status condition"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-204 IP 1.1.1.204/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-204 IP 1.1.1.60/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-74 IP 1.1.1.74/32"
level=debug msg="Tunnel server egress proxy updating Node ip-1-1-1-74 IP 1.1.1.233/32"
level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
level=info msg="Stopped tunnel to 1.1.1.233:9345"
level=info msg="Stopped tunnel to 1.1.1.60:9345"
level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
level=debug msg="Wrote ping"
$ get_etcd
{"level":"warn","ts":"2024-03-19T18:52:17.064Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000360a80/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
$ kgn -v9
I0319 19:00:04.670783 19405 loader.go:395] Config loaded from file: /etc/rancher/rke2/rke2.yaml
I0319 19:00:04.678590 19405 round_trippers.go:466] curl -v -XGET -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.29.2+rke2r1 (linux/amd64) kubernetes/4b8e819" 'https://127.0.0.1:6443/api/v1/nodes?limit=500'
I0319 19:00:04.678933 19405 round_trippers.go:508] HTTP Trace: Dial to tcp:127.0.0.1:6443 failed: dial tcp 127.0.0.1:6443: connect: connection refused
I0319 19:00:04.678977 19405 round_trippers.go:553] GET https://127.0.0.1:6443/api/v1/nodes?limit=500 in 0 milliseconds
I0319 19:00:04.678996 19405 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 0 ms Duration 0 ms
I0319 19:00:04.679005 19405 round_trippers.go:577] Response Headers:
I0319 19:00:04.679084 19405 helpers.go:264] Connection error: Get https://127.0.0.1:6443/api/v1/nodes?limit=500: dial tcp 127.0.0.1:6443: connect: connection refused
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
Can you grab the etcd pod logs?
have not done any further investigation
RKE2 tracking issue for