Closed Wieneo closed 8 months ago
I found part of the problem! I had .spec.additionalUserData set:
additionalUserData:
- name: ps_cloud_init.txt
type: text/cloud-config
content: |
REDACTED
Without additionalUserData set, the instances boot but don't join the cluster.
Without additionalUserData set, the instances boot but don't join the cluster.
Did you try that with a fresh cluster? What errors do you see during the boot sequence?
I tried it with a fresh cluster. The validate command fails with the following output (tailed):
KIND NAME MESSAGE
Machine 3d8b0b26-5470-42ca-9891-6feccd2a69aa machine "3d8b0b26-5470-42ca-9891-6feccd2a69aa" has not yet joined cluster
Machine 436828e0-2463-4e67-86d2-8f02d37402c9 machine "436828e0-2463-4e67-86d2-8f02d37402c9" has not yet joined cluster
Machine a2985f4d-9817-468d-b052-6d5addf58613 machine "a2985f4d-9817-468d-b052-6d5addf58613" has not yet joined cluster
Machine a7dc858f-1928-4bdb-8717-59067894f05f machine "a7dc858f-1928-4bdb-8717-59067894f05f" has not yet joined cluster
Machine ba9f7f63-b455-4e1a-9586-afca4a10a4e9 machine "ba9f7f63-b455-4e1a-9586-afca4a10a4e9" has not yet joined cluster
Machine d3528f13-474f-437c-bf3f-b7cae5113831 machine "d3528f13-474f-437c-bf3f-b7cae5113831" has not yet joined cluster
Pod kube-system/calico-kube-controllers-59d58646f4-pkbkc system-cluster-critical pod "calico-kube-controllers-59d58646f4-pkbkc" is pending
Pod kube-system/coredns-7cc468f8df-sj9xb system-cluster-critical pod "coredns-7cc468f8df-sj9xb" is pending
Pod kube-system/coredns-autoscaler-5fc98c7959-49754 system-cluster-critical pod "coredns-autoscaler-5fc98c7959-49754" is pending
Pod kube-system/csi-cinder-controllerplugin-56d6db9c57-zf4tc system-cluster-critical pod "csi-cinder-controllerplugin-56d6db9c57-zf4tc" is pending
Pod kube-system/dns-controller-74854cbb7f-qcm74 system-cluster-critical pod "dns-controller-74854cbb7f-qcm74" is pending
Validation Failed
W0509 11:04:25.234828 167 validate_cluster.go:232] (will retry): cluster not yet healthy
Error: validation failed: wait time exceeded during validation
I'm not quite sure what causes the error as the nodes seem fine:
master-az1-1-er3ovv ~ # systemctl --all --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed
master-az1-1-er3ovv ~ # ctr -n k8s.io c ls
CONTAINER IMAGE RUNTIME
07246fe3adda81a248699108803e116b5260b4c7c391679d4e343967f0e25831 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
12b4eddacab946e874dec0675e80c3e7cd81a52755ada184d4d9b7f9d6bf8330 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
12e60b40ad10e604b863ae684c41271af426982a03aa97786cd3dafce0b6a6a4 registry.k8s.io/kube-controller-manager@sha256:23a76a71f2b39189680def6edc30787e40a2fe66e29a7272a56b426d9b116229 io.containerd.runc.v2
4688c12cdfcf9366fc8523409115494823a24b4f1ba0ccdb026d1230cef67e27 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
54a21defe868f2f86bb588b5adb69d673879b06fd33f906b2fa6b558e6a38477 registry.k8s.io/etcdadm/etcd-manager@sha256:5ffb3f7cade4ae1d8c952251abb0c8bdfa8d4d9acb2c364e763328bd6f3d06aa io.containerd.runc.v2
643563a3d3e4ab40fa49b632c144d918e9cad9d94e4bcd5d47e285923060024a registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
678db0d6c86b5b694707dca9d0300d8d2107be82abb4fa36604e5c7799c139dd registry.k8s.io/kube-controller-manager@sha256:23a76a71f2b39189680def6edc30787e40a2fe66e29a7272a56b426d9b116229 io.containerd.runc.v2
83da13e648f1d3b52dadfccb6f05c9cc9d7d28849aefd8797e0b70630daed1ca registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
8bf86f696e1f9cc556100df803fb425217c0216af702d03722b46be078a11b40 registry.k8s.io/kube-apiserver@sha256:c8518e64657ff2b04501099d4d8d9dd402237df86a12f7cc09bf72c080fd9608 io.containerd.runc.v2
8e41f4eaa58fce83da9d6cd8a421efef04df9176d98f9e8f85bc48623fbefccd registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
972810dc74091a0cb8bca9518e5cd401c5e2ba2595780e43cb3a9d9e78dc8fcd registry.k8s.io/etcdadm/etcd-manager@sha256:5ffb3f7cade4ae1d8c952251abb0c8bdfa8d4d9acb2c364e763328bd6f3d06aa io.containerd.runc.v2
af2af2a34bf1a442213495428cb00b35047512f115dec94dad92e776f8a75e06 registry.k8s.io/kube-proxy@sha256:42fe09174a5eb6b8bace3036fe253ed7f06be31d9106211dcc4a09f9fa99c79a io.containerd.runc.v2
c8feaf253772950062b921e4f59369aae6d988940b79fa32da14dc9977681bb0 registry.k8s.io/kops/kube-apiserver-healthcheck@sha256:547c6bf1edc798e64596aa712a5cfd5145df0f380e464437a9313c1f1ae29756 io.containerd.runc.v2
c9dfe8396146b76247b262085a7a701ac5ece72847fb72984d2778cb1d24b28d registry.k8s.io/kube-scheduler@sha256:19712fa46b8277aafd416b75a3a3d90e133f44b8a4dae08e425279085dc29f7e io.containerd.runc.v2
f6f69768c5571fe745d63c7ba0022ed91b010594363e3fb3d1a037ae358e02c5 registry.k8s.io/kube-apiserver@sha256:c8518e64657ff2b04501099d4d8d9dd402237df86a12f7cc09bf72c080fd9608 io.containerd.runc.v2
Kubelet constantly logs the following error:
"Error getting node" err="node \"master-az1-1-er3ovv.novalocal\" not found"
Please let me know if you need more logs or info.
This means that your control plane is up and running.
Maybe ssh to a node and look for the kops-configuration.service
and kubelet.service
logs.
Also, this may help https://kops.sigs.k8s.io/operations/troubleshoot.
I looked into it and can't find the issue :( I uploaded the logs of the mentioned services: https://gist.github.com/Wieneo/47cddf4dca42e3f8e46b9925b3e37961
Could you try creating the cluster with --dns=none
?
@zetaab Any idea on what it may be wrong here?
no idea, I have not used flatcar (we are using ubuntu). I can try it tomorrow
Creating the cluster with --dns=none doesn't seem to fix the issue.
Creating the cluster with --dns=none doesn't seem to fix the issue.
The goal is to understand why the failure happens. You are the only person with access to the logs. My guess is that if you connect to the control plane, you should see which pods are running and CCM or API server should have some errors with hints.
The issue seems to stem from the fact that flatcar uses the FQDN of the node as a hostname. The node registers itself using the short name, and then tries to authenticate itself against the control plane using the fqdn. This leads to errors in the kube-apiserver.log
like the following:
I0509 13:43:54.264397 11 node_authorizer.go:285] NODE DENY: 'nodes-nova-4fyhog' &authorizer.AttributesRecord{User:(*user.DefaultInfo)(0xc00887b7c0), Verb:"get", Namespace:"", APIGroup:"storage.k8s.io", APIVersion:"v1", Resource:"csinodes", Subresource:"", Name:"nodes-nova-4fyhog.novalocal", ResourceRequest:true, Path:"/apis/storage.k8s.io/v1/csinodes/nodes-nova-4fyhog.novalocal"}
I0509 13:43:55.264465 11 node_authorizer.go:285] NODE DENY: 'nodes-nova-4fyhog' &authorizer.AttributesRecord{User:(*user.DefaultInfo)(0xc008a04ec0), Verb:"get", Namespace:"", APIGroup:"storage.k8s.io", APIVersion:"v1", Resource:"csinodes", Subresource:"", Name:"nodes-nova-4fyhog.novalocal", ResourceRequest:true, Path:"/apis/storage.k8s.io/v1/csinodes/nodes-nova-4fyhog.novalocal"}
I0509 13:43:56.264000 11 node_authorizer.go:285] NODE DENY: 'nodes-nova-4fyhog' &authorizer.AttributesRecord{User:(*user.DefaultInfo)(0xc0088a83c0), Verb:"get", Namespace:"", APIGroup:"storage.k8s.io", APIVersion:"v1", Resource:"csinodes", Subresource:"", Name:"nodes-nova-4fyhog.novalocal", ResourceRequest:true, Path:"/apis/storage.k8s.io/v1/csinodes/nodes-nova-4fyhog.novalocal"}
I0509 13:43:57.265231 11 node_authorizer.go:285] NODE DENY: 'nodes-nova-4fyhog' &authorizer.AttributesRecord{User:(*user.DefaultInfo)(0xc008978700), Verb:"get", Namespace:"", APIGroup:"storage.k8s.io", APIVersion:"v1", Resource:"csinodes", Subresource:"", Name:"nodes-nova-4fyhog.novalocal", ResourceRequest:true, Path:"/apis/storage.k8s.io/v1/csinodes/nodes-nova-4fyhog.novalocal"}
I0509 13:43:58.265137 11 node_authorizer.go:285] NODE DENY: 'nodes-nova-4fyhog' &authorizer.AttributesRecord{User:(*user.DefaultInfo)(0xc008a35d80), Verb:"get", Namespace:"", APIGroup:"storage.k8s.io", APIVersion:"v1", Resource:"csinodes", Subresource:"", Name:"nodes-nova-4fyhog.novalocal", ResourceRequest:true, Path:"/apis/storage.k8s.io/v1/csinodes/nodes-nova-4fyhog.novalocal"}
I accessed the node and ran the following:
hostnamectl set-hostname "nodes-nova-x04eyu"
systemctl restart systemd-networkd
systemctl restart kubelet
After which the node joined:
root@openstack-antelope:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
control-plane-nova-rruwba.novalocal Ready control-plane 6m35s v1.26.3
nodes-nova-x04eyu Ready node 67s v1.26.3
This seems to be an older issue that manifested on AWS as well: https://github.com/flatcar/Flatcar/issues/707
Not sure if this is something that should be fixed in flatcar or kops.
I will open a PR to address this in Flatcar in the following days.
As a side note, kops validate cluster continues to fail with:
root@openstack-antelope:~# kops validate cluster
Using cluster from kubectl context: my-cluster.k8s.local
Validating cluster my-cluster.k8s.local
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
control-plane-nova ControlPlane m1.medium 1 1 nova
nodes-nova Node m1.medium 1 1 nova
NODE STATUS
NAME ROLE READY
nodes-nova-x04eyu node True
VALIDATION ERRORS
KIND NAME MESSAGE
Machine b776c5b9-85e2-423b-afb8-79c5b61883ef machine "b776c5b9-85e2-423b-afb8-79c5b61883ef" has not yet joined cluster
Validation Failed
Error: validation failed: cluster not yet healthy
Even though the control plane node is up and Ready
.
Hi folks,
A short update.
A fix for this issue has merged in flatcar and is now available in the nightly builds of the next alpha release. If you want to test it out, you can download it here:
https://bincache.flatcar-linux.net/images/amd64/3602.0.0/flatcar_production_openstack_image.img.bz2
Keep in mind this is not a stable release.
Thanks!
Thanks for the update @gabriel-samfira. Any thoughts / info about additional userdata for cloudinit not working?
Flatcar is normally configured using ignition during first boot. To maintain compatibility with cloud-init based environments, it also has it's own agent that implements a subset of what cloud-init
offers, called coreos-cloudinit.
The additional userdata feature in kops uses the MIME multipart feature in cloud-init which allows it to add multiple files inside userdata. This particular feature of cloud-init is not implemented in coreos-cloudinit.
There are two options to get this working. Either we implement multipart in coreos-cloudinit
or we add ignition support in kops. Ignition is where most of the development is happening in Flatcar. It's the native way to configure it.
CC: @jepio @pothos
What do you think would be the best path forward?
So far the approach followed in similar efforts like CAPI support was to use Ignition (Fedora CoreOS and other Ignition users will also benefit from that).
At the moment, kOps doesn't have a way to know much about the distro image that is used before booting. It may be possible, but would require updating the implementation of all supported cloud providers. As things stand I see 3 possibilities:
MIME multipart
for Flatcar (eventually someone will contribute this feature if important enough for their use case)userdata
format (either cloudinit or
ignition`)MIME multipart
for Flatcar (not quite sure how big the effort would be here)Any thoughts about 2 & 3?
I think we can have both 2 & 3.
The short term solution would be to have MIME multipart support in coreos-cloudinit
, but long term we will need to add ignition
support to kops
, as that is the idiomatic (and in some cases, the only) way to configure distros that use ignition
.
I will open a separate issue for adding ignition
support in kops
.
The immediate issue reported here should be fixed (sans the additionalUserData
option) once a stable release of flatcar is cut with the above mentioned fix. @Wieneo could you test out the image I linked to and confirm it works for you?
A PR was created to add multipart support to coreos-cloudinit
here:
Thanks @gabriel-samfira.I appreciate the update.
I tested the newest Flatcar Alpha Image and kOps bootstrapped the cluster succesfully. 👍
Multipart mime support has been merged in the main branch of flatcar. This will probably be part of the next alpha release.
This means you'll be able to use additionalUserData
when deploying with kops
, as long as you only use the subset of cloud-config
that coreos-cloudinit
currently supports.
Excellent. Thanks a lot @gabriel-samfira!
I encountered a similar issue with flatcar using Kops 1.27 5.15.119-flatcar
on Openstack.
The static hostname assigned to the hosts has the .openstack.internal
suffix while the K8s certificate created don't have these in the subject name.
So you get errors like this on the worker nodes:
Aug 03 10:16:16 nodes-es1-gaj3jq.openstack.internal kubelet[1475]: I0803 10:16:16.585379 1475 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "nodes-es1-gaj3jq.openstack.internal" is forbidden: User "system:node:nodes-es1-gaj3jq" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
After manually changing the hostname, the node connects to the Cluster without issue.
core@nodes-es1-gaj3jq ~ $ openssl x509 -in /srv/kubernetes/kubelet-server.crt -noout -text
Certificate:
Data:
Version: 3 (0x2)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = kubernetes-ca
Subject: CN = nodes-es1-gaj3jq
core@nodes-es1-gaj3jq ~ $ hostnamectl
Static hostname: nodes-es1-gaj3jq.openstack.internal
Icon name: computer-vm
After fix:
core@nodes-es1-gaj3jq ~ $ hostnamectl
Static hostname: nodes-es1-gaj3jq
Icon name: computer-vm
This issue is fixed with the beta flatcar release 3602.0.0
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
From my side this can be closed. The current flatcar stable (3760.2.0) release works.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information. Client version: 1.26.32. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:11Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/arm64"}Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? OpenStack
4. What commands did you run? What is the simplest way to reproduce this issue?
-> Timeout
5. What happened after the commands executed? Validation of the cluster never succeeds as systemd bootup of instances fails. A look at the console of the instances reveals that flatcars ignition-fetch.service fails to start:
6. What did you expect to happen? Flatcar boots up normally.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Anything else do we need to know? I compared the user data generated by kOps and other tools (Gardener) and they appear to be using a completely diffrent format. kOps:
Gardener: