Open sebthom opened 2 years ago
Hi Team,
I am also facing this issue while executing "k0s start" in the worker node. Even "k0s stop" and "k0s delete" commands are also showing the same error after executing "k0s install worker" in the worker node. So I am unable to stop and reset k0s service.
Can you please provide some pointers?
The error message stems from the fact that the k0s systemd service is in a failed state. The k0s start command looks for the installed service and checks the status of the service to check if it's actually installed. That's where the "failed" error slips through. K0s should probably treat that case a bit differently in the start/stop/delete subcommands.
Can you try to start k0s manually via systemctl? Does it work then?
Apparently in my case an extra new line char was added accidentally to the token file on the worker nodes which prevented them from joining the controller. After I fixed this it works now.
Thanks for the feedback. Even if the root cause in your case was some configuration error, I'll be reopening this, since there's definitely something to be improved here on k0s side.
What if we check the service state in k0s start
and if we detect it failing we could print that info out to user with some help context like Service failed to start, check the logs with journalctl ...
?
@twz123 , thank you for reopening the ticket.
As @sebthom said, I checked my token-file and there was no extra character added. Yet, I am still getting the same issue. FYI, I am working on AWS x64 ubuntu instance as a controller node and another AWS Ubuntu x64 instance as a worker node.
Also, as mentioned above, I checked the systemctl list on the worker node, the k0sworker.service was found failing. I restarted the service manually using systemctl, but that did not resolve the issue.
Also, sudo k0s kubectl get nodes
shows "no resources found" on both the nodes - controller and worker.
k0sworker.service was found failing
As the service is failing, the logs probably contain some hints why it fails. Check with journalctl -u k0sworker.service ...
Hi Team, I again followed the manual installation for k0s.
I successfully created the controller node on an AWS Ubuntu instance and created another worker node on another AWS instance, using the join token created in the controller node.
FYI: I needed to include ‘--enable-worker’ flag along with k0s install controller
command.
While I execute k0s kubectl get nodes
in the controller node, below output is received:
NAME STATUS ROLES AGE VERSION
ip-172-31-5-166.us-east-2.compute.internal Ready control-plane 2m52s v1.24.2+k0s
This is OK as per the documents.
But my understanding says that the k0s kubectl get nodes
command should also display the information of the worker node, since worker node is successfully created using the join token.
May I know, am I correct in my understanding, or is the above output an expected behavior?
May I know, am I correct in my understanding, or is the above output an expected behavior?
if you have another instance running with k0s worker
it should appear in the node list. So something is off on the pure worker node.
I would advise to check the logs on the worker node using something like sudo journalctl -u k0sworker
. That should hopefully show some light as of why the worker is not able to connect with the controller.
As this is AWS infra, is there security groups configured in a way that allows the two nodes to properly connect with eachother? I'd start by enabling full allow within the SG where both nodes are in.
I even tried k0s installation on the local servers. The result is the same as on AWS.
‘journalctl -u k0sworker
’ command showed as below:
Jul 27 11:39:58 ip-172-31-19-8 k0s[8785]: time="2022-07-27 11:39:58" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Get \"https://172.31.46.24:6443/api/v1/namespaces/kube-system/configmaps/kubelet-config-default-1.24\": dial tcp 172.31.46.24:6443: i/o timeout"
This means, workers were able to get the controller’s IP, but could not connect to it, or could not read some config information.
I checked that my k0s.yml file already has the public IP of my controller, and now I am working on the local servers, so security group is no longer an issue.
Can you please guide me what next, I can check?
So clearly the worker node cannot connect to the controllers IP. Few things I'd check:
curl -k https://localhost:6443
curl -k https://<controller IP>:6443
; controller IP being here the one you'd expect workers to connect tocat k0s.token | base64 -d | gunzip
netcat -l 4444
curl <controller IP>:4444
; you should see some http headers being received on controllerThank you for the suggestions.
I curled the controller from worker and got the below results:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}
Connection failed. Also, decoding the join token showed the correct IP address of the controller.
And with the netcat
command, again connection timed out while curl, since there was no response on the controller side.
I tried running nginx service on the controller machine on port 80 and curled controller on port 80 (curl http://IP:80) from the worker machine. The connection is successful. I’m not sure why it’s failing with k0s.
I am reading this document and found that we need to configure firewall to outbound access port 6443 and 8132. I did that as follows:
iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 6443 -j ACCEPT
iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 8132 -j ACCEPT
But nothing really affected.
And with the netcat command, again connection timed out while curl, since there was no response on the controller side.
So, from worker, you can curl <controller IP>:6443
but it fails with netcat
? Sounds truly bizarre curl works but nothing else works. Possible that there's something like SELinux preventing the connections etc?
I just ran into a similar problem:
I rebooted all my worker nodes at the same time (to see what would happen in case there is some kind of failure). Each worker is now stuck:
core@node-003 ~ $ sudo journalctl -u k0sworker --follow
Nov 08 14:01:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:01:08 node-003.prod k0s[1583]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Scheduled restart job, restart counter is at 3.
Nov 08 14:03:08 node-003.prod systemd[1]: Stopped k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod k0s[1701]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.
I also tried to run k0sctl apply
again, but that complains about the unit being there but not started.
The join token is empty:
core@node-003 ~ $ sudo cat /etc/k0s/k0stoken
# overwritten by k0sctl after join
But I am also not sure why it would need the join token after a (simple) reboot?
Managed to recover with k0sctl apply --force
. Not entirely sure why that was necessary though.
Before creating an issue, make sure you've checked the following:
Version
v1.23.5+k0s.0
Platform
What happened?
I followed the multi node setup described at https://docs.k0sproject.io/v1.23.5+k0s.0/k0s-multi-node/
Setup of the controller node worked as described. However when trying to start a worker node the following error message appears without any further information:
Steps to reproduce
Expected behavior
The worker node gets federated into the controller and the k0s service on the worker node starts without failures.
Actual behavior
Starting k0s on the worker node fails with Error: service in failed state
Screenshots and logs
No response
Additional context