rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.21k stars 580 forks source link

Cluster agent not ready for 24 hours, apiserver saying v1beta1.metrics.k8s.io has been modified #2327

Closed Mythobeast closed 3 years ago

Mythobeast commented 3 years ago

When the cluster tries to start up, the kube-apiserver instances spam the following lines, over and over.

Rancher is running on docker on 10.90.48.101, nodes are on ~.102, ~.103, and ~.111

Any clue to get me going would be welcome.

I1111 22:50:03.392662       1 client.go:360] parsed scheme: "passthrough"
I1111 22:50:03.392720       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.90.48.102:2379  <nil> 0 <nil>}] <nil> <nil>}
I1111 22:50:03.392732       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
E1111 22:50:04.399597       1 available_controller.go:437] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
E1111 22:50:14.420179       1 available_controller.go:437] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.4.39:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.4.39:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
E1111 22:50:24.428815       1 available_controller.go:437] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
E1111 22:50:25.077096       1 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: dial tcp 10.43.4.39:443: i/o timeout
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
I1111 22:50:25.077118       1 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
E1111 22:50:29.440834       1 available_controller.go:437] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.4.39:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.4.39:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

RKE version: rancher/hyperkube:v1.19.3-rancher1

Docker version: (docker version,docker info preferred) Docker version 19.03.13, build 4484c46d9d

Operating system and kernel: (cat /etc/os-release, uname -r preferred) 3.10.0-1127.19.1.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-metal

cluster.yml file:

# 
# Cluster Config
# 
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: false
enable_network_policy: false
fleet_workspace_name: fleet-default
local_cluster_auth_endpoint:
  enabled: true
name: dhdc-k8s-b
# 
# Rancher Config
# 
rancher_kubernetes_engine_config:
  addon_job_timeout: 30
  authentication:
    strategy: x509|webhook
  bastion_host:
    ssh_agent_auth: false
  dns:
    linear_autoscaler_params: {}
    node_selector: null
    nodelocal:
      ip_address: ''
      node_selector: null
      update_strategy: {}
    reversecidrs: null
    stubdomains: null
    update_strategy: {}
    upstreamnameservers: null
  ignore_docker_version: true
# 
# # Currently only nginx ingress provider is supported.
# # To disable ingress controller, set `provider: none`
# # To enable ingress on specific nodes, use the node_selector, eg:
#    provider: nginx
#    node_selector:
#      app: ingress
# 
  ingress:
    provider: nginx
  kubernetes_version: v1.19.3-rancher1-2
  monitoring:
    provider: metrics-server
    replicas: 1
# 
#   If you are using calico on AWS
# 
#    network:
#      plugin: calico
#      calico_network_provider:
#        cloud_provider: aws
# 
# # To specify flannel interface
# 
#    network:
#      plugin: flannel
#      flannel_network_provider:
#      iface: eth1
# 
# # To specify flannel interface for canal plugin
# 
#    network:
#      plugin: canal
#      canal_network_provider:
#        iface: eth1
# 
  network:
    mtu: 0
    options:
      flannel_backend_type: vxlan
    plugin: canal
  restore:
    restore: false
# 
#    services:
#      kube-api:
#        service_cluster_ip_range: 10.43.0.0/16
#      kube-controller:
#        cluster_cidr: 10.42.0.0/16
#        service_cluster_ip_range: 10.43.0.0/16
#      kubelet:
#        cluster_domain: cluster.local
#        cluster_dns_server: 10.43.0.10
# 
  services:
    etcd:
      backup_config:
        enabled: true
        interval_hours: 12
        retention: 6
        safe_timestamp: false
      creation: 12h
      extra_args:
        election-timeout: '5000'
        heartbeat-interval: '500'
      gid: 0
      retention: 72h
      snapshot: false
      uid: 0
    kube-api:
      always_pull_images: false
      pod_security_policy: false
      service_node_port_range: 30000-32767
    kubelet:
      fail_swap_on: false
      generate_serving_certificate: false
  ssh_agent_auth: false
  upgrade_strategy:
    drain: false
    max_unavailable_controlplane: '1'
    max_unavailable_worker: 10%%
    node_drain_input:
      delete_local_data: false
      force: false
      grace_period: -1
      ignore_daemon_sets: true
      timeout: 120
scheduled_cluster_scan:
  enabled: false
  scan_config:
    cis_scan_config:
      override_benchmark_version: rke-cis-1.5
      profile: permissive
  schedule_config:
    cron_schedule: 0 0 * * *
    retention: 24

Steps to Reproduce: Provision servers with Centos-7 Create cluster with default values Add nodes to cluster

Results: Nodes register, but cluster stays in state Cluster health check failed: cluster agent is not ready

superseb commented 3 years ago

The agents are a part of Rancher, please file this using https://github.com/rancher/rancher/issues/new. Please search through existing issues first and use https://rancher.com/docs/rancher/v2.x/en/troubleshooting/, based on what is provided the overlay network is not being created successfully which could be due to (host) firewall or possible multi homed hosts and wrong interface auto detection.