"k0s start" on worker node fails with "Error: service in failed state"

sebthom commented 2 years ago

Before creating an issue, make sure you've checked the following:

[X] You are running the latest released version of k0s
[X] Make sure you've searched for existing issues, both open and closed
[X] Make sure you've searched for PRs too, a fix might've been merged already
[X] You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Version

v1.23.5+k0s.0

Platform

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

What happened?

I followed the multi node setup described at https://docs.k0sproject.io/v1.23.5+k0s.0/k0s-multi-node/

Setup of the controller node worked as described. However when trying to start a worker node the following error message appears without any further information:

$ sudo k0s start
Error: service in failed state
Usage:
  k0s start [flags]

Flags:
  -h, --help   help for start

Global Flags:
      --data-dir string   Data Directory for k0s (default: /var/lib/k0s). DO NOT CHANGE for an existing setup, things will break!
      --debug             Debug logging (default: false)

Steps to reproduce

On controller node execute:

  sudo su - root
  curl -sSLf https://get.k0s.sh | sh
  k0s install controller
  k0s start
  sleep 5
  k0s status
  k0s token create --role=worker --expiry=1h > /tmp/worker-join-token

Copy /tmp/worker-join-token from controller to worker node

On worker node execute:

  sudo su - root
  curl -sSLf https://get.k0s.sh | sh 
  k0s install worker --token-file /tmp/worker-join-token
  k0s start

Expected behavior

The worker node gets federated into the controller and the k0s service on the worker node starts without failures.

Actual behavior

Starting k0s on the worker node fails with Error: service in failed state

Screenshots and logs

No response

Additional context

$ k0s sysinfo
KERNEL_VERSION: 5.4.0-107-generic
CONFIG_INET: enabled
CONFIG_NETFILTER_XT_TARGET_REDIRECT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled (as module)
CONFIG_NAMESPACES: enabled
CONFIG_UTS_NS: enabled
CONFIG_IPC_NS: enabled
CONFIG_PID_NS: enabled
CONFIG_NET_NS: enabled
CONFIG_CGROUPS: enabled
CONFIG_CGROUP_FREEZER: enabled
CONFIG_CGROUP_PIDS: enabled
CONFIG_CGROUP_DEVICE: enabled
CONFIG_CPUSETS: enabled
CONFIG_CGROUP_CPUACCT: enabled
CONFIG_MEMCG: enabled
CONFIG_CGROUP_SCHED: enabled
CONFIG_FAIR_GROUP_SCHED: enabled
CONFIG_EXT4_FS: enabled
CONFIG_PROC_FS: enabled
CONFIG_OVERLAY_FS: enabled (as module)
CONFIG_BLK_DEV_DM: enabled
CONFIG_CFS_BANDWIDTH: enabled
CONFIG_CGROUP_HUGETLB: enabled
CONFIG_SECCOMP: enabled
CONFIG_SECCOMP_FILTER: enabled
CONFIG_BRIDGE: enabled (as module)
CONFIG_IP6_NF_FILTER: enabled (as module)
CONFIG_IP6_NF_IPTABLES: enabled (as module)
CONFIG_IP6_NF_MANGLE: enabled (as module)
CONFIG_IP6_NF_NAT: enabled (as module)
CONFIG_IP_NF_FILTER: enabled (as module)
CONFIG_IP_NF_IPTABLES: enabled (as module)
CONFIG_IP_NF_MANGLE: enabled (as module)
CONFIG_IP_NF_NAT: enabled (as module)
CONFIG_IP_NF_TARGET_REJECT: enabled (as module)
CONFIG_IP_SET: enabled (as module)
CONFIG_IP_SET_HASH_IP: enabled (as module)
CONFIG_IP_SET_HASH_NET: enabled (as module)
CONFIG_IP_VS_NFCT: enabled
CONFIG_LLC: enabled (as module)
CONFIG_NETFILTER_NETLINK: enabled (as module)
CONFIG_NETFILTER_XTABLES: enabled (as module)
CONFIG_NETFILTER_XT_MARK: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_MULTIPORT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_RECENT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_STATISTIC: enabled (as module)
CONFIG_NETFILTER_XT_NAT: enabled (as module)
CONFIG_NETFILTER_XT_SET: enabled (as module)
CONFIG_NETFILTER_XT_TARGET_MASQUERADE: enabled (as module)
CONFIG_NF_CONNTRACK: enabled (as module)
CONFIG_NF_DEFRAG_IPV4: enabled (as module)
CONFIG_NF_DEFRAG_IPV6: enabled (as module)
CONFIG_NF_NAT: enabled (as module)
CONFIG_NF_REJECT_IPV4: enabled (as module)
CONFIG_STP: enabled (as module)
OS: Linux
CGROUPS_CPU: enabled
CGROUPS_CPUACCT: enabled
CGROUPS_CPUSET: enabled
CGROUPS_DEVICES: enabled
CGROUPS_FREEZER: enabled
CGROUPS_MEMORY: enabled
CGROUPS_PIDS: enabled
CGROUPS_HUGETLB: enabled

odidev commented 2 years ago

Hi Team,

I am also facing this issue while executing "k0s start" in the worker node. Even "k0s stop" and "k0s delete" commands are also showing the same error after executing "k0s install worker" in the worker node. So I am unable to stop and reset k0s service.

Can you please provide some pointers?

twz123 commented 2 years ago

The error message stems from the fact that the k0s systemd service is in a failed state. The k0s start command looks for the installed service and checks the status of the service to check if it's actually installed. That's where the "failed" error slips through. K0s should probably treat that case a bit differently in the start/stop/delete subcommands.

Can you try to start k0s manually via systemctl? Does it work then?

sebthom commented 2 years ago

Apparently in my case an extra new line char was added accidentally to the token file on the worker nodes which prevented them from joining the controller. After I fixed this it works now.

twz123 commented 2 years ago

Thanks for the feedback. Even if the root cause in your case was some configuration error, I'll be reopening this, since there's definitely something to be improved here on k0s side.

jnummelin commented 2 years ago

What if we check the service state in k0s start and if we detect it failing we could print that info out to user with some help context like Service failed to start, check the logs with journalctl ...?

odidev commented 2 years ago

@twz123 , thank you for reopening the ticket.

As @sebthom said, I checked my token-file and there was no extra character added. Yet, I am still getting the same issue. FYI, I am working on AWS x64 ubuntu instance as a controller node and another AWS Ubuntu x64 instance as a worker node.

Also, as mentioned above, I checked the systemctl list on the worker node, the k0sworker.service was found failing. I restarted the service manually using systemctl, but that did not resolve the issue.

Also, sudo k0s kubectl get nodes shows "no resources found" on both the nodes - controller and worker.

jnummelin commented 2 years ago

k0sworker.service was found failing

As the service is failing, the logs probably contain some hints why it fails. Check with journalctl -u k0sworker.service ...

odidev commented 2 years ago

Hi Team, I again followed the manual installation for k0s.

I successfully created the controller node on an AWS Ubuntu instance and created another worker node on another AWS instance, using the join token created in the controller node.

FYI: I needed to include ‘--enable-worker’ flag along with k0s install controller command.

While I execute k0s kubectl get nodes in the controller node, below output is received:

NAME                                         STATUS   ROLES           AGE     VERSION 
ip-172-31-5-166.us-east-2.compute.internal   Ready    control-plane   2m52s   v1.24.2+k0s

This is OK as per the documents.

But my understanding says that the k0s kubectl get nodes command should also display the information of the worker node, since worker node is successfully created using the join token.

May I know, am I correct in my understanding, or is the above output an expected behavior?

jnummelin commented 2 years ago

May I know, am I correct in my understanding, or is the above output an expected behavior?

if you have another instance running with k0s worker it should appear in the node list. So something is off on the pure worker node.

I would advise to check the logs on the worker node using something like sudo journalctl -u k0sworker. That should hopefully show some light as of why the worker is not able to connect with the controller.

As this is AWS infra, is there security groups configured in a way that allows the two nodes to properly connect with eachother? I'd start by enabling full allow within the SG where both nodes are in.

odidev commented 2 years ago

I even tried k0s installation on the local servers. The result is the same as on AWS.

‘journalctl -u k0sworker’ command showed as below:

Jul 27 11:39:58 ip-172-31-19-8 k0s[8785]: time="2022-07-27 11:39:58" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Get \"https://172.31.46.24:6443/api/v1/namespaces/kube-system/configmaps/kubelet-config-default-1.24\": dial tcp 172.31.46.24:6443: i/o timeout"

This means, workers were able to get the controller’s IP, but could not connect to it, or could not read some config information.

I checked that my k0s.yml file already has the public IP of my controller, and now I am working on the local servers, so security group is no longer an issue.

Can you please guide me what next, I can check?

jnummelin commented 2 years ago

So clearly the worker node cannot connect to the controllers IP. Few things I'd check:

is the apiserver on controller listening properly? i.e. can you access the API (e.g. via curl) locally on the controller:
- curl -k https://localhost:6443
- curl -k https://<controller IP>:6443; controller IP being here the one you'd expect workers to connect to
is the token having the expected address? You can decode the token with cat k0s.token | base64 -d | gunzip
can you actually connect from worker to controller? test e.g. with netcat:
- on controller, run netcat -l 4444
- on worker, run curl <controller IP>:4444; you should see some http headers being received on controller

odidev commented 2 years ago

Thank you for the suggestions.

I curled the controller from worker and got the below results:

{ 
  "kind": "Status", 
  "apiVersion": "v1", 
  "metadata": {}, 
  "status": "Failure", 
  "message": "Unauthorized", 
  "reason": "Unauthorized", 
  "code": 401 
}

Connection failed. Also, decoding the join token showed the correct IP address of the controller.

And with the netcat command, again connection timed out while curl, since there was no response on the controller side.

I tried running nginx service on the controller machine on port 80 and curled controller on port 80 (curl http://IP:80) from the worker machine. The connection is successful. I’m not sure why it’s failing with k0s.

I am reading this document and found that we need to configure firewall to outbound access port 6443 and 8132. I did that as follows:

iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 6443 -j ACCEPT 
iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 8132 -j ACCEPT

But nothing really affected.

jnummelin commented 2 years ago

And with the netcat command, again connection timed out while curl, since there was no response on the controller side.

So, from worker, you can curl <controller IP>:6443 but it fails with netcat? Sounds truly bizarre curl works but nothing else works. Possible that there's something like SELinux preventing the connections etc?

till commented 11 months ago

I just ran into a similar problem:

I rebooted all my worker nodes at the same time (to see what would happen in case there is some kind of failure). Each worker is now stuck:

core@node-003 ~ $ sudo journalctl -u k0sworker --follow
Nov 08 14:01:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:01:08 node-003.prod k0s[1583]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Scheduled restart job, restart counter is at 3.
Nov 08 14:03:08 node-003.prod systemd[1]: Stopped k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod k0s[1701]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.

I also tried to run k0sctl apply again, but that complains about the unit being there but not started.

till commented 11 months ago

The join token is empty:

core@node-003 ~ $ sudo cat /etc/k0s/k0stoken 
# overwritten by k0sctl after join

But I am also not sure why it would need the join token after a (simple) reboot?

till commented 11 months ago

Managed to recover with k0sctl apply --force. Not entirely sure why that was necessary though.

k0sproject / k0s