Closed stevet284 closed 3 years ago
Could some one from Rancher take a look at this please ? It is delaying our deployment . @superseb ?
Please supply more info on the setup, the more info supplied the easier it is to reproduce and diagnose:
docker info
from the nodesdocker ps -a
from the nodesip a s
from the nodeskubelet
containerRHEL8 is only supported on Kubernetes 1.19 without firewalld. Please reproduce with a single node with all roles in a cluster and supply the output. RHEL8 support was validated in https://github.com/rancher/rancher/issues/23045 but maybe something changed in later releases.
Hi Seb, Thanks for picking up this issue. I tried a few more times today to build an RKE cluster from Rancher 2.5.3 on RHEL 8.2, but all have the same result:
Cluster health check failed: cluster agent is not ready
For some reason I am unable to attach files here "something went really wrong " message. So I have created a new public repo and put the files there: https://github.com/stevet284/rancher_logs
I'll test RHEL7 and let you know if that works.
Thanks again. Steve
Based on the IP output, you are running the nodes on the same subnet as the cluster network? (10.42.
)
If so, please make sure you create a cluster with unique subnets for cluster network and service network. This is shown on https://rancher.com/docs/rke/latest/en/config-options/services/.
Yes Seb, you are correct, I had just figured out that too - our internal network does indeed use 10.42.x It too a while to get a working YAML, but it did work eventually.
Thanks for you help !
Here is the cluster.yml that finally worked (flannel) in case it helps anyone else in future:
answers: {}
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
fleet_workspace_name: fleet-default
local_cluster_auth_endpoint:
enabled: true
name: flanneltest
rancher_kubernetes_engine_config:
addon_job_timeout: 45
authentication:
strategy: x509|webhook
authorization: {}
bastion_host:
ssh_agent_auth: false
cloud_provider: {}
dns:
linear_autoscaler_params: {}
node_selector: null
nodelocal:
ip_address: ''
node_selector: null
update_strategy:
rolling_update: {}
reversecidrs: null
stubdomains: null
update_strategy: {}
upstreamnameservers: null
ignore_docker_version: true
ingress:
http_port: 0
https_port: 0
provider: nginx
kubernetes_version: v1.19.6-rancher1-1
monitoring:
provider: metrics-server
replicas: 1
network:
mtu: 0
options:
flannel_backend_port: '4789'
flannel_backend_type: vxlan
flannel_backend_vni: '4096'
plugin: flannel
restore:
restore: false
services:
etcd:
backup_config:
enabled: true
interval_hours: 12
retention: 28
safe_timestamp: false
creation: 12h
extra_args:
election-timeout: '5000'
heartbeat-interval: '500'
gid: 0
retention: 72h
snapshot: false
uid: 0
kube-api:
always_pull_images: false
pod_security_policy: false
service_cluster_ip_range: 172.19.0.0/16
service_node_port_range: 30000-32767
kube-controller:
cluster_cidr: 172.18.0.0/16
service_cluster_ip_range: 172.19.0.0/16
kubelet:
cluster_dns_server: 172.19.0.10
cluster_domain: cluster.local
fail_swap_on: false
generate_serving_certificate: false
kubeproxy: {}
scheduler: {}
ssh_agent_auth: false
upgrade_strategy:
max_unavailable_controlplane: '1'
max_unavailable_worker: 10%
node_drain_input:
delete_local_data: false
force: false
grace_period: -1
ignore_daemon_sets: true
timeout: 120
I am am getting the error Cluster health check failed: cluster agent is not ready for any new clusters that I build. This is a baremetal install on HyperV VMs (RHEL 8.2) Testing adding 2 nodes, one has all roles, other has control plane and etcd
I have tried many things:
Each time the failure looks the same, on the node that has the worker role I see one container that has exited:
docker container ls -a |grep agent 8acb721ebd1c 263ad36fcb47 "run.sh" 2 minutes ago Exited (1) 9 seconds ago k8s_cluster-register_cattle-cluster-agent-77cf944646-ck95l_cattle-system_131509c0-ab06-4eb0-b47c-c91532ca9ba0_2
The container log looks like this:
INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local somedomain.com options ndots:5 ERROR: https://rancher.somedomain.com/ping is not accessible (Failed to connect to rancher.somedomain.com port 443: Connection timed out)
If I start the container again and exec to it then test with curl to https:///ping it hangs.
However if I try the same tests from other containers on the same node it works:
curl -k https://rancher.somedomain.com/ping pong
So it does look like some kind of CNI issue, I have tried flannel, Calico and Canal, all have the same issue.
We are not running firewalld or selinux
Using Rancher v2.4.11 Creating a RKE cluster on the same VMs works fine with no errors.
Please can someone advise what other logs to look into to investigate further ? Perhaps someone from Rancher could reproduce this in their labs ? It should be easy to reproduce.
Many Thanks in advance Steve
v2.5.3
UI:v2.5.3
undefined