siderolabs / talos-cloud-controller-manager

Generic cloud controller manager for hybrid deployments using Talos OS
MIT License
48 stars 7 forks source link

Missing node labels #61

Closed alexandrem closed 2 weeks ago

alexandrem commented 1 year ago

Bug Report

Some nodes don't have their labels about the zone and instance type attached.

Description

On an OpenStack cluster where I installed CCM, I have one node that gets properly labelled, while the others only have the platform=openstack label added.

k get nodes -o wide --show-labels                                       
NAME               STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME     LABELS
alex-cp-1          Ready    control-plane   2d16h   v1.26.3   10.173.75.153   <none>        Talos (v1.3.7)   5.15.106-talos   containerd://1.6.20   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=alex-cp-1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.cloudprovider.kubernetes.io/platform=openstack
alex-worker-os-1   Ready    <none>          2d16h   v1.26.3   10.173.75.141   <none>        Talos (v1.3.7)   5.15.106-talos   containerd://1.6.20   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=alex-worker-os-1,kubernetes.io/os=linux,node.cloudprovider.kubernetes.io/platform=openstack
alex-worker-os-2   Ready    <none>          2d16h   v1.26.3   10.173.75.140   <none>        Talos (v1.3.7)   5.15.106-talos   containerd://1.6.20   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=alex-worker-os-2,kubernetes.io/os=linux,node.cloudprovider.kubernetes.io/platform=openstack
alex-worker-os-3   Ready    <none>          2d14h   v1.26.3   10.173.75.143   <none>        Talos (v1.3.7)   5.15.106-talos   containerd://1.6.20   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=ug4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=z03,kubernetes.io/arch=amd64,kubernetes.io/hostname=alex-worker-os-3,kubernetes.io/os=linux,node.cloudprovider.kubernetes.io/platform=openstack,node.kubernetes.io/instance-type=ug4.large,topology.kubernetes.io/zone=z03

I tried recreating the os-1 node, restarting kubelet, the CCM pod, but it doesn't kick in.

Logs for CCM look okay.

Logs

Environment

sergelogvinov commented 1 year ago

Hello, Can you check the metadata resource:

talosctl --nodes 10.173.75.141 get PlatformMetadatas.talos.dev -oyaml
talosctl --nodes 10.173.75.143 get PlatformMetadatas.talos.dev -oyaml

Thanks.

alexandrem commented 1 year ago

Here:

$ talosctl -n 10.173.75.141 get PlatformMetadatas.talos.dev -oyaml
node: 10.173.75.141
metadata:
    namespace: runtime
    type: PlatformMetadatas.talos.dev
    id: platformmetadata
    version: 1
    owner: network.PlatformConfigController
    phase: running
    created: 2023-06-12T12:50:31Z
    updated: 2023-06-12T12:50:31Z
spec:
    platform: openstack
    hostname: alex-worker-os-1.novalocal
    zone: z03
    instanceType: ug4.large
    instanceId: 8fa473df-fce8-4434-9f96-0a753c922782
    providerId: openstack:///8fa473df-fce8-4434-9f96-0a753c922782
$ talosctl -n 10.173.75.143 get PlatformMetadatas.talos.dev -oyaml
node: 10.173.75.143
metadata:
    namespace: runtime
    type: PlatformMetadatas.talos.dev
    id: platformmetadata
    version: 1
    owner: network.PlatformConfigController
    phase: running
    created: 2023-06-09T22:38:37Z
    updated: 2023-06-09T22:38:37Z
spec:
    platform: openstack
    hostname: alex-worker-os-3.novalocal
    zone: z03
    instanceType: ug4.large
    instanceId: b558ce7e-9725-49b4-a9ee-5db61e9690f4
    providerId: openstack:///b558ce7e-9725-49b4-a9ee-5db61e9690f4
sergelogvinov commented 1 year ago

Good, do you have only one CCM in the cluster? or do you use TalosCCM with Openstack-CCM ?

alexandrem commented 1 year ago

I only have the Talos CCM, nothing else.

sergelogvinov commented 1 year ago

Ok, can you set the -v=4 logVerbosityLevel flag to the CCM

logVerbosityLevel: 4

reboot CCM, delete node resource and reboot the deleted instance. It will register again and you will see more details from TalosCCM. i think it will help us to understand the issue.

Thanks.

alexandrem commented 1 year ago

Deleting the k8s node objects and restarting the kubelet service on each Talos worker node solved the problem. The new node resources have all the labels.

alexandrem commented 1 year ago

I'm not sure yet, but I seem to encounter some race condition in the bootstrapping of the K8s cluster when configured with external cloud provider and Talos CCM.

In one recent case, I've observed that the coredns pods are pending to be scheduled because they don't have the needed tolerations for node.cloudprovider.kubernetes.io/uninitialized which breaks the talos ccm for some DNS resolving (which is strange because metadata is normally at 169.254.169.254, so not sure if there's something else before that), and it becomes stuck.

This can be unblocked by first ensuring that internal dns resolving works with coredns, which in my case may require adding manually the toleration for taint above to coredns deployment to make it run on a CP node with external cloud provider configured, then possibly deleting all the registered node resources and restarting the kubelet on each node.

Once that initial chicken-and-egg problem is resolved, additional worker nodes that register to the cluster get labeled properly.

alexandrem commented 1 year ago

Looking at the code I now understand what's going on.

It's Talos Linux itself that does the call to the OpenStack metadata service to query the machine information, which is then stored as PlatformMetadata COSI resource. This CCM project implements the k8s cloud-provider InstanceMetadata interface, then queries Talos API to fetch the resource information and proceed with the k8s node labeling.

This is where the DNS resolving occurs, because of the COSI call https://github.com/siderolabs/talos-cloud-controller-manager/blob/v1.4.0/pkg/talos/client.go#L53

which by defaults attempts to resolve the talos.default k8s service injected in default namespace by the kubernetesTalosAPIAccess machine config feature flag.

sergelogvinov commented 1 year ago

I'm not sure yet, but I seem to encounter some race condition in the bootstrapping of the K8s cluster when configured with external cloud provider and Talos CCM.

In one recent case, I've observed that the coredns pods are pending to be scheduled because they don't have the needed tolerations for node.cloudprovider.kubernetes.io/uninitialized which breaks the talos ccm for some DNS resolving (which is strange because metadata is normally at 169.254.169.254, so not sure if there's something else before that), and it becomes stuck.

This can be unblocked by first ensuring that internal dns resolving works with coredns, which in my case may require adding manually the toleration for taint above to coredns deployment to make it run on a CP node with external cloud provider configured, then possibly deleting all the registered node resources and restarting the kubelet on each node.

Once that initial chicken-and-egg problem is resolved, additional worker nodes that register to the cluster get labeled properly.

Yep, it was fixed here https://github.com/siderolabs/talos/pull/6938 And you can use Daemonset mode too (check the helm-chart, CCM can work without DNS) https://github.com/siderolabs/talos-cloud-controller-manager/blob/75a8e44b137d06c80651286568a275b102502048/charts/talos-cloud-controller-manager/values.yaml#L108

alexandrem commented 1 year ago

Ok great, so this fix was added in the v1.4.0 release, hence why I encounter the problem in v1.3.7.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 14 days with no activity.