telekom / das-schiff-network-operator

Configure netlink interfaces, simple eBPF filters and FRR using Kubernetes resources.
Apache License 2.0
29 stars 2 forks source link

[Bug] Discover the node via its HOSTNAME is not very practical if the hostname differs from the k8s nodename. #41

Closed Cellebyte closed 1 year ago

Cellebyte commented 1 year ago

Description

Error from the operator log.

github.com/telekom/das-schiff-network-operator/pkg/debounce.(*Debouncer).debounceRoutine
    /workspace/pkg/debounce/debounce.go:35
2023-08-28T09:27:48.495Z    ERROR   controller.vrfrouteconfiguration    error debouncing    {"reconciler group": "network.schiff.telekom.de", "reconciler kind": "VRFRouteConfiguration", "name": "bla", "namespace": "", "error": "Node \"customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9.site.location.example.com\" not found"}
$ kubectl get nodes

NAME                                   STATUS   ROLES           AGE     VERSION
customer-1-7jnmh                           Ready    control-plane   2d20h   v1.25.12
customer-1-md-md-0-bb44bbb9fxzlnbj-6vkrl   Ready    worker          2d20h   v1.25.12
customer-1-md-md-0-bb44bbb9fxzlnbj-dxc5f   Ready    worker          2d19h   v1.25.12
customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9   Ready    worker          2d20h   v1.25.12
customer-1-w7k6b                           Ready    control-plane   2d20h   v1.25.12
customer-1-wk5dt                           Ready    control-plane   2d20h   v1.25.12

# login on the node
root@customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9:/# hostname
customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9.site.location.example.com
root@customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9:/# hostname --fqdn
customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9.site.location.example.com

root@customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9:/# hostnamectl status
   Static hostname: customer-1-md-md-0-bb44bbb9fxzlnbj-lvpd9.site.location.example.com
   Pretty hostname: localhost.local
         Icon name: computer-server
           Chassis: server
        Machine ID: cd9f27f0e96c4d74a3e4f37404cbc802
           Boot ID: b6139e6f644c4fa4a7a21d41cdbcfa22
  Operating System: Ubuntu 20.04.6 LTS
            Kernel: Linux 5.15.0-79-generic
      Architecture: x86-64

Findings

Currently the operator discovers the node it is running on via the HOSTNAME environment variable. This becomes problematic, when the hostname differs from the nodename in kubernetes.

Calico is using a spec field ref into their environment Variables called NODENAME.

Suggestion

Use something like this in the deployment of network operators daemonset to ensure the environment variable is always correctly set for reconciliation loops.

        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
schrej commented 1 year ago

This will be added as part of #35. We'll just need to use it for this as well.

p-strusiewiczsurmacki-mobica commented 1 year ago

@schrej do we want to add this fix in #35 or should it be different PR?

schrej commented 1 year ago

Considering it should only be a small change I'd say just add it to #35 to avoid more PRs that need to be rebased constantly.

p-strusiewiczsurmacki-mobica commented 1 year ago

Just pushed changes. Changed that here and here

However I need to ask as I am not sure - should k8s node's name be also used in frr package instead of os.Hostname() call??

Cellebyte commented 1 year ago

However I need to ask as I am not sure - should k8s node's name be also used in frr package instead of os.Hostname() call??

I think using the nodename env variable here would make it consistant.

FRR Docs also states it is not written to disk.

It should also not harm from the perspective of our specific setup as we already set it to the nodename everywhere.

      joinConfiguration:
        discovery: {}
        nodeRegistration:
          criSocket: /var/run/containerd/containerd.sock
          kubeletExtraArgs:
            container-log-max-files: '5'
            container-log-max-size: 10Mi
            event-qps: '0'
            node-ip: '{{ ds.meta_data.cluster_ip_0 }}'
            node-labels: "metal3.io/uuid={{ ds.meta_data.uuid }},schiff.telekom.de/bmh={{ ds.meta_data.bmh_name }},schiff.telekom.de/metal3machine={{ ds.meta_data.metal3machine_name }}"
            protect-kernel-defaults: 'true'
            tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
          name: '{{ ds.meta_data.machine_name }}' # <- we set the nodename here
     files:
     # and also here.
      - owner: root:root
        path: /etc/frr/frr.conf
        permissions: '0644'
        content: |
          frr version 8.0.1
          frr defaults traditional
          hostname {{ ds.meta_data.machine_name }}
p-strusiewiczsurmacki-mobica commented 1 year ago

@Cellebyte changed os.Hostname() to os.Getenv() to get the node name in pkg/frr/configure.go