rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
22.86k stars 2.92k forks source link

[BUG] Rancher-provisioned vSphere node stuck at "Waiting to register with Kubernetes" #45846

Open grzleadams opened 1 month ago

grzleadams commented 1 month ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug I'm trying to set up vSphere provisioning of nodes for a downstream cluster. Although the Rancher nodes and the app (downstream) cluster are in different subnets, the traffic between them is wide open. The app cluster currently has 10 nodes (3 etcd/cp, 7 worker-only). I used the Terraform provider to create a node template, node pool, etc., using the same VM template as for all the other vSphere-based nodes. It was all created correctly in Rancher, the VM gets provisioned, Docker gets installed, the rancher-agent spins up, and then... nothing. There are no errors server-side or agent-side, and the node just sits at Waiting to register with Kubernetes forever.

To Reproduce I'm not sure, because I'm not sure what's going wrong, exactly.

Result

Expected Result The node should register with the cluster.

Screenshots I can provide whatever screenshots would be useful. I'm just not sure what to provide.

Additional context It might be related or unrelated, but when I delete the node in the node pool, Rancher cleans up but the VM is never shut down/deleted in vSphere. I thought this was fixed in 2.5.x.

grzleadams commented 1 month ago

I guess the Rancher logs do occasionally contain:

2024/06/17 16:38:00 [DEBUG] Failed to get node for machine [m-lx5x4], preparing to delete

That's the machine ID of the node that I'm trying to get to join the cluster. There's nothing to indicate why there's not a node for that machine, though.

grzleadams commented 1 month ago

If it helps, here is the (redacted) rancher-agent log.

INFO: Arguments: --server https://rancher.<domain> --token REDACTED --ca-checksum e9bc407434efb1aaf9a3ddbb1155bc767718e444cdc2521a5311243102ec9798 -r -n m-92tbq
INFO: Environment: CATTLE_ADDRESS=10.1.17.139 CATTLE_AGENT_CONNECT=true CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=m-92tbq CATTLE_RANCHER_WEBHOOK_VERSION=103.0.1+up0.4.2 CATTLE_SERVER=https://rancher.<domain> CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0 trust-ad search crl.local
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://rancher.<domain>/ping is accessible
INFO: rancher.<domain> resolves to 172.16.4.100
INFO: Value from https://rancher.<domain>/v3/settings/cacerts is an x509 certificate
time="2024-06-17T18:35:18Z" level=info msg="Listening on /tmp/log.sock"
time="2024-06-17T18:35:18Z" level=info msg="Rancher agent version v2.8.2 is starting"
time="2024-06-17T18:35:18Z" level=info msg="Option worker=false"
time="2024-06-17T18:35:18Z" level=info msg="Option requestedHostname=m-92tbq"
time="2024-06-17T18:35:18Z" level=info msg="Option dockerInfo={d68df0f8-1d98-473f-a0c8-7c8e2fd6ff42 1 1 0 0 1 overlay2 [[Backing Filesystem extfs] [Supports d_type true] [Using metacopy false] [Native Overlay Diff true] [userxattr false]] [] {[local] [bridge host ipvlan macvlan null overlay] [] [awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} true false true true true true true true true true true true false 31 true 41 2024-06-17T18:35:18.283447585Z json-file cgroupfs 1 0 5.4.0-182-generic Ubuntu 20.04.6 LTS 20.04 linux x86_64 https://index.docker.io/v1/ 0xc00037b500 24 33705844736 [] /var/lib/docker    app-cluster1 [provider=vmwarevsphere] false 24.0.9   map[io.containerd.runc.v2:{runc [] <nil>} runc:{runc [] <nil>}] runc {  inactive false  [] 0 0 <nil> []} false  docker-init {d2d58213f83a351ca8f528a95fbd145f5654e957 d2d58213f83a351ca8f528a95fbd145f5654e957} {v1.1.12-0-g51d5e94 v1.1.12-0-g51d5e94} {de40ad0 de40ad0} [name=apparmor name=seccomp,profile=builtin]  [] [WARNING: No swap limit support]}"
time="2024-06-17T18:35:18Z" level=info msg="Option customConfig=map[address:10.1.17.139 internalAddress: label:map[] roles:[] taints:[]]"
time="2024-06-17T18:35:18Z" level=info msg="Option etcd=false"
time="2024-06-17T18:35:18Z" level=info msg="Option controlPlane=false"
time="2024-06-17T18:35:18Z" level=info msg="Connecting to wss://rancher.<domain>/v3/connect with token starting with <REDACTED>"
time="2024-06-17T18:35:18Z" level=info msg="Connecting to proxy" url="wss://rancher.<domain>/v3/connect"
time="2024-06-17T18:35:18Z" level=info msg="Starting plan monitor, checking every 120 seconds"

I'm not sure why all three of the etcd/controlPlane/worker options are set to false... the node spec specifies worker to be true.

apiVersion: management.cattle.io/v3
kind: Node
<snip>
spec:
  controlPlane: false
  customConfig: null
  desiredNodeTaints: null
  displayName: ''
  etcd: false
  imported: false
  internalNodeSpec: {}
  metadataUpdate:
    annotations: {}
    labels: {}
  nodePoolName: c-p8tjx:devops-worker
  nodeTemplateName: cattle-global-nt:nt-8s9mx
  requestedHostname: app-cluster1
  worker: true
grzleadams commented 1 month ago

I've pretty much ruled out issues on the node and network... deleting the node pool but keeping the VM around, and then adding it with rke up adds it to the cluster without a problem.