siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
5.75k stars 466 forks source link

TalosAPI service endpoint #7196

Closed sergelogvinov closed 1 day ago

sergelogvinov commented 1 year ago

Bug Report

Description

We lost endpoint resource of talos.default.svc This is my fist try of v1.4.x talos version, I do not know when we lost it...

Logs

# talosctl --talosconfig _cfgs/talosconfig --nodes IP dmesg | grep EndpointController 
host: user: warning: [2023-05-09T05:21:45.570901952Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \x5c"https://api.cluster.local:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\x5c": dial tcp 127.0.0.1:6443: connect: connection refused"}
host: user: warning: [2023-05-09T05:22:30.353156952Z]: [talos] updated Talos API endpoints in Kubernetes {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "endpoints": ["172.16.0.50"]}
# kube --kubeconfig=kubeconfig -n default get svc
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
kubernetes   ClusterIP   10.200.0.1     <none>        443/TCP     148m
talos        ClusterIP   10.200.3.131   <none>        50000/TCP   148m
# kube --kubeconfig=kubeconfig -n default get ep           
NAME         ENDPOINTS          AGE
kubernetes   172.16.0.50:6443   148m

Environment

utkuozdemir commented 1 year ago

We do not delete the Endpoints resource ever after it was created. Are you sure you are targeting the correct cluster?

Are you sure that there was the Endpoints resource ever?

Can you please check the machine configs of your control plane nodes to verify that machine.features.kubernetesTalosAPIAccess.enabled is true in all of them?

Which CNI are you using?

sergelogvinov commented 1 year ago

I have terraform setup, and it works well with talos v1.3.6. I've just replaced base image to v1.4.2, and reinstall the cluster. After that Talos CCM does not work, it cannot connect to the talos-api.

So, kubernetes api does not change, only talos image has changed.
Talos creates service and update endpoint [talos] updated Talos API endpoints but after something happened.

    features:
        rbac: true
        stableHostname: true
        kubernetesTalosAPIAccess:
            enabled: true
            allowedRoles:
                - os:reader
            allowedKubernetesNamespaces:
                - kube-system
        apidCheckExtKeyUsage: true
utkuozdemir commented 1 year ago

I'm not able to reproduce this. The last log line [talos] updated Talos API endpoints is only printed if the endpoints is created, and it will never be deleted afterwards - even if you disable back the feature.

Please share if you have any further findings. If you could share the exact steps to reproduce, that'd be great.

smira commented 1 year ago

I wonder if that's inadvertent upgrade of Kubernetes as well, i.e. Kubernetes version is not set in the machine config?

What is in the kubectl get ... -o yaml output? My understanding of the issue is that the endpoint no longer is attached to the service?

smira commented 1 year ago

@utkuozdemir I talked to Serge today, and the issue looks like that Talos updates the endpoints resource (or it prints that it does), but the endpoint resource gets lost. This happens in a single controlplane node cluster, where the node in theory does this exactly once.

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

sergelogvinov commented 1 day ago

probably it was fixed...