siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.04k stars 492 forks source link

Failing to bootstap etcd on Proxmox VM due to networking issues. #7657

Closed erickaby closed 11 months ago

erickaby commented 11 months ago

Bug Report

Talos v1.5.1 failing to bootstap etcd on Proxmox VM due to networking issues.

Description

Hi, I am setting up Talos on my homelab and recently changed my local network from 192.168.0.1/16 to 10.0.0.1/8. I have been having issues getting Talos past the booting stage after the installation. I don't fully believe it is a Talos issue however since i can get past the Maintenance mode and reach the VM I want to rule out Talos before reverting and resetting my local network. So, I have went back to basics and have followed the Proxmox guide. My setup isn't unique with a default Proxmox install on my local home network which is as basic as you can think. I have tried both with and without DHCP, both allow me to interact with the VM before and during the booting state through talosctl. The only thing problem here is that I need to add the IP address of the VM into machine.certSANs if i don't, I get the error below. Ive scoured the support.zip file logs to find anything useful to keep debugging but here I am.

I have attached the support.zip below, also i enabled debug: true.

Logs

# Error if IP address is not in machine.certSANs
$ talosctl -n 10.0.0.128 service
error listing services: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.128 because it doesn't contain any IP SANs"
talosctl support --nodes 10.0.0.128                                         
   2s [>-----------------]  11% 10.0.0.128: getting system/dashboard service logs
   2s [==================] 100% 10.0.0.128: getting talos resource secrets/TrustdCertificates.secrets.talos.dev
Processed with errors:
   SOURCE    ERROR
   cluster   Get "https://10.0.0.128:6443/api/v1/namespaces/kube-system": dial tcp 10.0.0.128:6443: connect: connection refused
   cluster   failed to get kubernetesResources/nodes.yaml: Get "https://10.0.0.128:6443/api/v1/nodes": dial tcp 10.0.0.128:6443: connect: connection refused, skipped
   cluster   failed to get kubernetesResources/systemPods.yaml: Get "https://10.0.0.128:6443/api/v1/namespaces/kube-system/pods": dial tcp 10.0.0.128:6443: connect: connection refused, skipped
Support bundle is written to support.zip

Full project with the configuration files, secrets and support.zip inside (this is throwaway project) project.zip

Environment

frezbo commented 11 months ago

The fact that support command worked means there's no error in talking to talos API, i guess when the services command was run Talos was still re-generating new certs

smira commented 11 months ago

Usually the reason might be that Kubernetes Pod/Service CIDRs overlap with the machine IPs. Talos API won't issue a cert for an address which is within pod/service CIDR range.

Just in case Talos defaults are:

        podSubnets:
            - 10.244.0.0/16
        # The service subnet CIDR.
        serviceSubnets:
            - 10.96.0.0/12

Your IP seems to be different though.

smira commented 11 months ago

But it's a bug in Talos actually, your 10.0.0.128/8 address overlaps with pod/service CIDRs if taken as a subnet, but as a single address it actually isn't contained in that subnet.

As an interim fix, you change pod/service subnets to be from e.g. 192.168. space, but we'll get this fixed

erickaby commented 11 months ago

Thanks for the feedback, that does make sense. I didn't think to try to change them cidr ranges, I'll be testing your solution shortly.

erickaby commented 11 months ago

Usually the reason might be that Kubernetes Pod/Service CIDRs overlap with the machine IPs. Talos API won't issue a cert for an address which is within pod/service CIDR range.

Just in case Talos defaults are:


        podSubnets:

            - 10.244.0.0/16

        # The service subnet CIDR.

        serviceSubnets:

            - 10.96.0.0/12

Your IP seems to be different though.

That piece of information is very useful. Is that already in the doco and I have missed it? Would be nice to have under troubleshooting the control plane, since that was where I was expecting to read some help on the cert issue.