rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.45k stars 2.97k forks source link

Error for new RKE cluster on Rancher 2.5.x Cluster health check failed: cluster agent is not ready #30739

Closed stevet284 closed 3 years ago

stevet284 commented 3 years ago

I am am getting the error Cluster health check failed: cluster agent is not ready for any new clusters that I build. This is a baremetal install on HyperV VMs (RHEL 8.2) Testing adding 2 nodes, one has all roles, other has control plane and etcd

I have tried many things:

Each time the failure looks the same, on the node that has the worker role I see one container that has exited:

docker container ls -a |grep agent 8acb721ebd1c 263ad36fcb47 "run.sh" 2 minutes ago Exited (1) 9 seconds ago k8s_cluster-register_cattle-cluster-agent-77cf944646-ck95l_cattle-system_131509c0-ab06-4eb0-b47c-c91532ca9ba0_2

The container log looks like this:

INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local somedomain.com options ndots:5 ERROR: https://rancher.somedomain.com/ping is not accessible (Failed to connect to rancher.somedomain.com port 443: Connection timed out)

If I start the container again and exec to it then test with curl to https:///ping it hangs.

However if I try the same tests from other containers on the same node it works:

curl -k https://rancher.somedomain.com/ping pong

So it does look like some kind of CNI issue, I have tried flannel, Calico and Canal, all have the same issue.

We are not running firewalld or selinux

Using Rancher v2.4.11 Creating a RKE cluster on the same VMs works fine with no errors.

Please can someone advise what other logs to look into to investigate further ? Perhaps someone from Rancher could reproduce this in their labs ? It should be easy to reproduce.

Many Thanks in advance Steve


Useful Info
Versions Rancher v2.5.3 UI: v2.5.3
Route undefined
stevet284 commented 3 years ago

Could some one from Rancher take a look at this please ? It is delaying our deployment . @superseb ?

superseb commented 3 years ago

Please supply more info on the setup, the more info supplied the easier it is to reproduce and diagnose:

RHEL8 is only supported on Kubernetes 1.19 without firewalld. Please reproduce with a single node with all roles in a cluster and supply the output. RHEL8 support was validated in https://github.com/rancher/rancher/issues/23045 but maybe something changed in later releases.

stevet284 commented 3 years ago

Hi Seb, Thanks for picking up this issue. I tried a few more times today to build an RKE cluster from Rancher 2.5.3 on RHEL 8.2, but all have the same result:

Cluster health check failed: cluster agent is not ready

For some reason I am unable to attach files here "something went really wrong " message. So I have created a new public repo and put the files there: https://github.com/stevet284/rancher_logs

I'll test RHEL7 and let you know if that works.

Thanks again. Steve

superseb commented 3 years ago

Based on the IP output, you are running the nodes on the same subnet as the cluster network? (10.42.)

If so, please make sure you create a cluster with unique subnets for cluster network and service network. This is shown on https://rancher.com/docs/rke/latest/en/config-options/services/.

stevet284 commented 3 years ago

Yes Seb, you are correct, I had just figured out that too - our internal network does indeed use 10.42.x It too a while to get a working YAML, but it did work eventually.

Thanks for you help !

Here is the cluster.yml that finally worked (flannel) in case it helps anyone else in future:

answers: {}
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
fleet_workspace_name: fleet-default
local_cluster_auth_endpoint:
  enabled: true
name: flanneltest

rancher_kubernetes_engine_config:
  addon_job_timeout: 45
  authentication:
    strategy: x509|webhook
  authorization: {}
  bastion_host:
    ssh_agent_auth: false
  cloud_provider: {}
  dns:
    linear_autoscaler_params: {}
    node_selector: null
    nodelocal:
      ip_address: ''
      node_selector: null
      update_strategy:
        rolling_update: {}
    reversecidrs: null
    stubdomains: null
    update_strategy: {}
    upstreamnameservers: null
  ignore_docker_version: true

  ingress:
    http_port: 0
    https_port: 0
    provider: nginx
  kubernetes_version: v1.19.6-rancher1-1
  monitoring:
    provider: metrics-server
    replicas: 1

  network:
    mtu: 0
    options:
      flannel_backend_port: '4789'
      flannel_backend_type: vxlan
      flannel_backend_vni: '4096'
    plugin: flannel
  restore:
    restore: false

  services:
    etcd:
      backup_config:
        enabled: true
        interval_hours: 12
        retention: 28
        safe_timestamp: false
      creation: 12h
      extra_args:
        election-timeout: '5000'
        heartbeat-interval: '500'
      gid: 0
      retention: 72h
      snapshot: false
      uid: 0
    kube-api:
      always_pull_images: false
      pod_security_policy: false
      service_cluster_ip_range: 172.19.0.0/16
      service_node_port_range: 30000-32767
    kube-controller:
      cluster_cidr: 172.18.0.0/16
      service_cluster_ip_range: 172.19.0.0/16
    kubelet:
      cluster_dns_server: 172.19.0.10
      cluster_domain: cluster.local
      fail_swap_on: false
      generate_serving_certificate: false
    kubeproxy: {}
    scheduler: {}
  ssh_agent_auth: false
  upgrade_strategy:
    max_unavailable_controlplane: '1'
    max_unavailable_worker: 10%
    node_drain_input:
      delete_local_data: false
      force: false
      grace_period: -1
      ignore_daemon_sets: true
      timeout: 120