rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 397 forks source link

Unable to add agent to cluster #409

Open fcioffi opened 4 years ago

fcioffi commented 4 years ago

Hi guys, I'm trying to install a k3os cluster in VBox with 2 virtual machine, 1 server and 1 agent. The first works great, but I can't add the agent. Both Virtual machine has 2 netwrk interfaces:

k3os version v0.9.1 5.0.0-37-generic #40~18.04.1 SMP Wed Jan 15 04:09:29 UTC 2020 x86_64

Steps:

  1. Load first vm from iso
  2. run sudo k3os install, insert default parameters, with token: "myToken"
  3. start first vm from disk
  4. Load second vm from iso
  5. run sudo k3os install, insert default parameters, with option for agent and url "https://:6443" token: "myToken"
  6. start second vm from disk

when last start in the /var/log/k3s-service.log of server I get: I0401 13:25:03.838029 2284 log.go:172] http: TLS handshake error from 192.168.56.107:57838: remote error: tls: bad certificate

From host machine kubectl works:

$ kubectl cluster-info
Kubernetes master is running at https://192.168.56.105:6443
CoreDNS is running at https://192.168.56.105:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://192.168.56.105:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

but, unfortunately:

$ kubectl get nodes
NAME        STATUS   ROLES    AGE   VERSION
k3os-9564   Ready    master   80m   v1.17.2+k3s1

show only the master.

Can you help me? Thanks, Francesco

digitalism commented 4 years ago

Maybe have a look at https://github.com/digitalism/k3os-box, I just got networking of a 3 node cluster working.

k3os-server [~]$ kubectl get nodes
NAME          STATUS   ROLES    AGE     VERSION
k3os-server   Ready    master   8m37s   v1.17.2+k3s1
k3os-1        Ready    <none>   8m12s   v1.17.2+k3s1
k3os-2        Ready    <none>   8m12s   v1.17.2+k3s1
k3os-3        Ready    <none>   8m12s   v1.17.2+k3s1
evoncken commented 4 years ago

I have the exact same issue on the k3os 0.10.0 release.

Result:

I seem to remember that my previous install (using an older version of k3os) did NOT have this issue; I had a 3-node cluster up and running using the same method.

BibbyChung commented 4 years ago

me too. i got the same issues.

the k3os version is k3os version v0.11.0-rc1

the errors message in k3os server is

I0710 10:18:04.777834 2458 log.go:172] http: TLS handshake error from 192.168.31.126:54474: remote error: tls: bad certificate time="2020-07-10T10:18:04.837090973Z" level=error msg="Node password validation failed for 'miwifi-r1cm-srv', using passwd file '/var/lib/rancher/k3s/server/cred/node-passwd'"

but if i kill k3s agent then join it by myself. it work well.

k3s agent --with-node-id --server https://192.168.31.211:6443 --token "K10a0d38146071578aa46b8e81b34048d03d0f377f8b7ace2e48bbec6c234b36e95::server:1234"

please help...

BibbyChung commented 4 years ago

me too. i got the same issues.

the k3os version is k3os version v0.11.0-rc1

the errors message in k3os server is

I0710 10:18:04.777834 2458 log.go:172] http: TLS handshake error from 192.168.31.126:54474: remote error: tls: bad certificate time="2020-07-10T10:18:04.837090973Z" level=error msg="Node password validation failed for 'miwifi-r1cm-srv', using passwd file '/var/lib/rancher/k3s/server/cred/node-passwd'"

but if i kill k3s agent then join it by myself. it work well.

k3s agent --with-node-id --server https://192.168.31.211:6443 --token "K10a0d38146071578aa46b8e81b34048d03d0f377f8b7ace2e48bbec6c234b36e95::server:1234"

please help...

i find the solution. just setup the "ntp server" and make sure 2 vm have different "hostname". anything will be ok... magic ^^||...

hostname: test-master
ntp_servers:
- 0.us.pool.ntp.org
- 1.us.pool.ntp.org
sb3rg commented 4 years ago

Any updates here? I'm having the same issue. Release 0.11.0-rc1. Followed same steps as @evoncken, but agent doesn't connect. I've added the token from server, add the server ip. Here is my agent yaml file below. the server config is similar but without the --server and --token flags and it works just fine. What am I missing?

k3os agent config

ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQ.....
hostname: z420_2
k3os:
  k3s_args:
    - agent
    - "--node-ip=192.168.1.45"
    - "--flannel-iface=eth0"
    - "--server=https://192.168.1.43:6443"
    - "--token=K105a96558bd927049955bc6a9060aaabff0b6dafd7e3fc286f0e21dfae57ac1b67::serv
er:2f44c9b100d35cbd9c8c78caaa4cc0b4"
chadmayfield commented 3 years ago

Any updates? I'm seeing this on v0.11.1 as well.

chriscarpenter12 commented 3 years ago

Using v0.11.1 I'm able to join an agent, but upon applying a config after install to change the hostname, configure static ip and add a worker label the agent fails to connect back to the master.

Server Config - /var/lib/rancher/k3os/config.yaml

hostname: k3s-master
write_files:
- path: /var/lib/connman/default.config
  content: |-
    [service_eth0]
    Type=ethernet
    IPv4=10.1.1.50/255.255.255.0/10.1.1.1
    IPv6=off
    Nameservers=10.1.1.1
k3os:
  dns_nameservers:
  - 10.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  - 1.us.pool.ntp.org

Agent Config - /var/lib/rancher/k3os/config.yaml

hostname: k3s-worker1
write_files:
- path: /var/lib/connman/default.config
  content: |-
    [service_eth0]
    Type=ethernet
    IPv4=10.1.1.51/255.255.255.0/10.1.1.1
    IPv6=off
    Nameservers=10.1.1.1
k3os:
  dns_nameservers:
  - 10.1.1.1
  labels:
    node-role.kubernetes.io/worker: ""
  ntp_servers:
  - 0.us.pool.ntp.org
  - 1.us.pool.ntp.org

agent.log server.log

Edit: Have even tried upgrading server to v0.19.4-dev.5 and pruning stale entries in /var/lib/rancher/k3s/server/cred/node-passwd but no joy. Upon agent reboot the node is added back to server node-passwd file, but log still show bad cert.

time="2020-11-26T17:46:47.845635098Z" level=info msg="Handling backend connection request [k3s-worker1]"
time="2020-11-26T17:46:47.855621917Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-11-26T17:46:53.849317918Z" level=info msg="Cluster-Http-Server 2020/11/26 17:46:53 http: TLS handshake error from 10.1.1.51:53260: remote error: tls: bad certificate"
time="2020-11-26T17:46:53.879528915Z" level=info msg="Cluster-Http-Server 2020/11/26 17:46:53 http: TLS handshake error from 10.1.1.51:53272: remote error: tls: bad certificate"
time="2020-11-26T17:46:53.901297377Z" level=info msg="certificate CN=k3s-worker1 signed by CN=k3s-server-ca@1606402873: notBefore=2020-11-26 15:01:13 +0000 UTC notAfter=2021-11-26 17:46:53 +0000 UTC"
time="2020-11-26T17:46:53.906918046Z" level=info msg="certificate CN=system:node:k3s-worker1,O=system:nodes signed by CN=k3s-client-ca@1606402873: notBefore=2020-11-26 15:01:13 +0000 UTC notAfter=2021-11-26 17:46:53 +0000 UTC"

Edit 2: Removing my label from the agent config finally allowed it to join to the master. Figured this out by killing the k3s process on the agent machine and trying out different args. When I removed the label it joined successfully.

EugenMayer commented 3 years ago

Same here with 0.11.1 using the correct default ways of

server

ssh_authorized_keys:
- <redacted>
hostname: k3os.<redacted>
k3os:
  modules:
  - kvm
  - nvme
  dns_nameservers:
  - 1.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  token: supersecret

agent

ssh_authorized_keys:
- ssh-rsa <redacted>
hostname: k3os-agent.<redacted>
k3os:
  modules:
  - kvm
  - nvme
  dns_nameservers:
  - 1.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  server_url: https://10.xx.xx.serverip:6443
  token: K1023d39969b1298dfb394bde1a93bcae9c5c7bc4dea29fa28c1b87a6344308613a::server:supersecret

I already use different hostname and also timeservers. As others named, using it on the cli works fine on the agent node

sudo k3s agent --with-node-id --server https://10.10.xx.serverip:6443 --token "K1023d39969b1298dfb394bde1a93bcae9c5c7bc4dea29fa28c1b87a6344308613a::server:supersecret"

As far as i understand the docs and the k3s installation script https://github.com/k3s-io/k3s/blob/master/install.sh#L161 is that the mode is derived by either dedicated k3s_args OR if not set the following cases are determined

a) by if the server_url is set or not. If the server_url is set, the token will be treated as a cluster secret to join, the command will be agent. b)if no server_url is provided, we will have a cluster secret by the token, and the command will be server.

This said, the yaml files above should cause the agent to connect on boot time, which is not the case.

In the case of @digitalsm https://github.com/digitalism/k3os-box/blob/master/scripts/configure_k3s_node.sh#L27 the entire k3s_args are constructed the same way as expected. I did not try this in my case but @s3rgb did and seem to have failed (which i would not have expected)

UPDATE: I was able to add the actual agent without any cli manipulations after cloud-init

ssh_authorized_keys:
- ssh-rsa <redacted>
hostname: k3os-agent.<redacted>
k3os:
  modules:
  - kvm
  - nvme
  dns_nameservers:
  - 1.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  # we cannot use server_url nor token since we need to override k3s_args which would override / overrule those 2
  # server_url: https://10.xx.xx.serverip:6443
  # token: supersecret
  k3s_args:
    - agent
    - "--server=10.x.x.serverip:6443"
    - "--token=supersecret"
    - "--with-node-id"

Hope this helps anybody else. Beside that, i'am not really sure if k3os is a serious consideration considering the current status in terms of documentation and drive and support to get things like this minimal setup up and running. It seems like k3s has a lot of drive, k3os seems to fall behind. Since i'am using rancherOS for years now i say this with an sad mind, but maybe there is just not enough reason for rancher to push or invest into k3os - fair enough.

UPDATE2: Be aware, if you use k3s_args for the agent as given above, you will fail to configure the agent using k3os install or k3os config later on, e.g. when the server ip changed (or the token). This would always just introduce the token: and server_url: key in /var/lib/rancher/k3os/config.yaml and since it is overriden in k3s_args those values are ignored and take no effect. An ugly side effect.

You can work-arround by manipulating the /var/lib/rancher/k3os/config.yaml by hand and then calling k3os config

UPDATE3: After a couple more installation experiments i could make the best of it. The assumption, that k3s_args are overriding server_url and token by default was wrong - those are still added to k3s_args in addition to what we place there. This even means, that also -agent ist not needed, since this will be done automatically since we set server_url

So the final agent config which then can also later be modified and reconfigured using k3os install or k3os config would be

server

ssh_authorized_keys:
- <redacted>
hostname: k3os.<redacted>
k3os:
  modules:
  - kvm
  - nvme
  dns_nameservers:
  - 1.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  token: supersecret

agent

ssh_authorized_keys:
- ssh-rsa <redacted>
hostname: k3os-agent.<redacted>
k3os:
  modules:
  - kvm
  - nvme
  dns_nameservers:
  - 1.1.1.1
  ntp_servers:
  - 0.us.pool.ntp.org
  # we cannot use server_url nor token since we need to override k3s_args which would override / overrule those 2
  server_url: https://10.xx.xx.serverip:6443
  token: supersecret
  k3s_args:
    # needed if we use the same hypevisor as the master node
    - "--with-node-id"