AWS cloud_provider configuration prevents kube-apiserver from starting

DH-Rancher commented 4 years ago

RKE version:

rke -v                                       
rke version v1.0.0

Docker version: (docker version,docker info preferred)

Client:
 Debug Mode: false

Server:
 Containers: 33
  Running: 20
  Paused: 0
  Stopped: 13
 Images: 13
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-1051-aws
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.851GiB
 Name: ip-172-31-34-199
 ID: RZHB:AHJA:4XEB:F6IU:W7UB:CZYI:XJCU:VZH5:XNJE:L3TK:Q23D:5T46
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

AWS

cluster.yml file:

nodes:
  - address: OMITTED
    internal_address: OMITTED
    user: ubuntu
    role: [controlplane,worker,etcd]
    ssh_key_path:  OMITTED

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

cloud_provider:
    name: aws

Steps to Reproduce:

Run rke up --config ./rancher-cluster.yml, pointing to the aforementioned manifest

Results:

Health check of the kube-apiserver service will fail:

FATA[0088] [controlPlane] Failed to bring up Control Plane: [Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [18.130.246.46]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1128 14:55:39.185072       1 config.go:56] Error reading from cloud configuration file /etc/kubernetes/cloud-config: &os.PathError{Op:"open", Path:"/etc/kubernetes/cloud-config", Err:0x2}]

The message implies that it cannot read /etc/kubernetes/cloud-config. On the host, this file does not exist.

ls -la /etc/kubernetes/
total 12
drwxr-xr-x  3 root root 4096 Nov 28 14:57 .
drwxr-xr-x 95 root root 4096 Nov 28 12:36 ..
drwxr-xr-x  2 root root 4096 Nov 28 13:21 .tmp

Workaround

Monitor the docker container on the host with watch docker ps When the Kube-apiserver container is failing to start run sudo touch /etc/kubernetes/cloud-config && docker restart kube-apiserver

Note : Omitting

cloud_provider:
    name: aws

From the manifest yields a successfully created cluster.

DH-Rancher commented 4 years ago

Referencing for visibility @galal-hussein @superseb

excieve commented 4 years ago

Same issue with:

cloud_provider:
  name: external

Which shouldn't even use --cloud-config if I understand it correctly.

excieve commented 4 years ago

Worked it around in my RKE template like this:

kube-api:
  extra_args:
    cloud-config: ''
kubelet:
  extra_args:
    cloud-config: ''

Not sure if both are required or just kubelet.

Rancheroo commented 4 years ago

Just hit this issue also with RKE v1.0 Can provide more info if it will help

kube-apiserver logs

+ exec kube-apiserver --kubelet-client-certificate=/etc/kubernetes/ssl/kube-apiserver.pem --service-account-key-file=/etc/kubernetes/ssl/kube-service-account-token-key.pem --cloud-config=/etc/kubernetes/cloud-config --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --requestheader-allowed-names=kube-apiserver-proxy-client --tls-private-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --profiling=false --tls-cert-file=/etc/kubernetes/ssl/kube-apiserver.pem --bind-address=0.0.0.0 --advertise-address=172.31.4.11 --storage-backend=etcd3 --etcd-cafile=/etc/kubernetes/ssl/kube-ca.pem --kubelet-client-key=/etc/kubernetes/ssl/kube-apiserver-key.pem --proxy-client-cert-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem --service-node-port-range=30000-32767 --requestheader-username-headers=X-Remote-User --cloud-provider=aws --etcd-keyfile=/etc/kubernetes/ssl/kube-node-key.pem --etcd-servers=https://172.31.4.11:2379,https://172.31.4.14:2379 --allow-privileged=true --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize --etcd-certfile=/etc/kubernetes/ssl/kube-node.pem --etcd-prefix=/registry --secure-port=6443 --anonymous-auth=false --proxy-client-key-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client-key.pem --insecure-port=0 --requestheader-group-headers=X-Remote-Group --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --service-cluster-ip-range=10.43.0.0/16 --requestheader-extra-headers-prefix=X-Remote-Extra- --service-account-lookup=true --runtime-config=authorization.k8s.io/v1beta1=true --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I1202 03:24:11.270667       1 server.go:623] external host was not specified, using 172.31.4.11
I1202 03:24:11.271001       1 server.go:149] Version: v1.16.3
F1202 03:24:11.666224       1 config.go:56] Error reading from cloud configuration file /etc/kubernetes/cloud-config: &os.PathError{Op:"open", Path:"/etc/kubernetes/cloud-config", Err:0x2}

rke debug

DEBU[0085] [healthcheck] Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [13.211.57.225]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused)
DEBU[0090] using "/private/tmp/com.apple.launchd.whYo3UnOCi/Listeners" SSH_AUTH_SOCK
DEBU[0090] using "/private/tmp/com.apple.launchd.whYo3UnOCi/Listeners" SSH_AUTH_SOCK
DEBU[0091] [healthcheck] Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [13.211.132.166]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused)
DEBU[0091] [healthcheck] Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [13.211.57.225]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused)
DEBU[0096] Checking container logs
DEBU[0096] using "/private/tmp/com.apple.launchd.whYo3UnOCi/Listeners" SSH_AUTH_SOCK
DEBU[0096] Checking container logs
DEBU[0096] using "/private/tmp/com.apple.launchd.whYo3UnOCi/Listeners" SSH_AUTH_SOCK
FATA[0097] [controlPlane] Failed to bring up Control Plane: [Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [13.211.132.166]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1202 03:24:11.456913       1 config.go:56] Error reading from cloud configuration file /etc/kubernetes/cloud-config: &os.PathError{Op:"open", Path:"/etc/kubernetes/cloud-config", Err:0x2}]

bmdepesa commented 4 years ago

I was able to reproduce with RKE v1.0.0:

Create ec2 node w/ IAM profile & appropriate tags
```
nodes:
```
address:
user: ubuntu role: [controlplane,etcd,worker]

cloud_provider: name: aws

* `rke up` fails with:

FATA[0120] [controlPlane] Failed to bring up Control Plane: [Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host []: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1203 16:58:56.421984 1 config.go:56] Error reading from cloud configuration file /etc/kubernetes/cloud-config: &os.PathError{Op:"open", Path:"/etc/kubernetes/cloud-config", Err:0x2}]



The same `cluster.yml` succeeds with RKE v0.3.2.

superseb commented 4 years ago

This was changed in https://github.com/rancher/rke/commit/372393ac1bbf0eaf70048b4061c6816df4018a01#diff-822461a71c0db81849eb077c6cf33d47R39. The file content was being checked for non-empty, in case it will be removed. As we just deploy the cloud-config without any content for AWS (I guess this was a design choice to not avoid having different code path for each cloud provider), it won't deploy and even run the container with rm -f. This will also break updates (when moving from v0.3.2 to v1.0.0). Workaround is to specify a "default" cloud config for AWS, this is also why upgrade in Rancher (v2.3.2 -> v2.3.3) doesn't break this, as we configure this as default when creating the cluster with the AWS cloud provider.

cloud_provider:
  name: aws
  awsCloudProvider:
    global:

The other workaround (mentioned in https://github.com/rancher/rke/issues/1805#issuecomment-559865752) is basically not setting the parameter so it won't look for it.

deniseschannon commented 4 years ago

Available in RKE v1.0.1-rc1 and RKE v1.1.0-rc1

bmdepesa commented 4 years ago

Following my steps from this comment: https://github.com/rancher/rke/issues/1805#issuecomment-561264252

The cluster is able to be created successfully with RKE v1.0.1-rc1, and the empty cloud-config is deployed.

The cluster still fails to create with RKE v1.1.0-rc1 with the same error:

FATA[0078] [controlPlane] Failed to bring up Control Plane: [Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [<host>]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1206 18:17:46.603009       1 config.go:56] Error reading from cloud configuration file /etc/kubernetes/cloud-config: &os.PathError{Op:"open", Path:"/etc/kubernetes/cloud-config", Err:0x2}]

and no cloud-config is deployed to the node

deniseschannon commented 4 years ago

Available in v1.1.0-rc2

izaac commented 4 years ago

I got passed the error reported in v1.1.0-rc1 by Brandon here: https://github.com/rancher/rke/issues/1805#issuecomment-562682890

But now im getting a different error in v1.1.0-rc3:

FATA[0160] [workerPlane] Failed to bring up Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10248/healthz for service [kubelet] on host [x.x.x.x]: Get http://localhost:10248/healthz: Unable to access the service on localhost:10248. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: F1224 19:38:29.074157   11538 server.go:271] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-xxxxxxxxxxxxx: "error listing AWS instances: \"NoCredentialProviders: no valid providers in chain. Deprecated.\\n\\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors\""]

nodes:
  - address: OMITTED
    internal_address: OMITTED
    user: ubuntu
    role: [controlplane,worker,etcd]
    ssh_key_path: OMITTED

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

cloud_provider:
  name: aws

sangeethah commented 4 years ago

@izaac Test it with AWS nodes having the right IAMProfile.

izaac commented 4 years ago

This is working with the proper IAM profile and following the docs to configure the tag requirements. https://rancher.com/docs/rke/latest/en/config-options/cloud-providers/aws/

Tested with v1.1.0-rc3

tokiwong commented 4 years ago

Same issue with v1.1.11

Tried setting

cloud_provider:
  name: aws
  awsCloudProvider:
    global:

and

kube-api:
  extra_args:
    cloud-config: ''
kubelet:
  extra_args:
    cloud-config: ''

Having issues with verifying health check for both the worker and control plane, with Failed to verify healthcheck:...

IAM permissions and tags have been verified, and interestingly, the nodes come up as Ready with their internal hostnames, but <none> for their roles before rke fails

rancher / rke

AWS cloud_provider configuration prevents kube-apiserver from starting #1805