siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Talos 0.14.2 - Cannot consistently successfully deploy more than a single control plane node #5031

Closed danmanners closed 1 year ago

danmanners commented 2 years ago

Bug Report

While attempting to provision a cluster in Proxmox v7.X, the first control plane node can be deployed successfully, but any and all additional control plane or worker nodes will almost never connect, and eventually kernel panic and require a hard reset in order to attempt and fail connection again.

Description

Only a single master ever provisions successfully, and all other nodes refuse to connect.

I am deploying all of this on a Proxmox Cluster, and I have successfully deployed Talos nearly identically in the past (without KubeSpan) on version 0.13.X. I'm quite stumped at this point, and I'm trying to identify if I'm actually encountering an unexpected behavior, or if my configs are incorrect.

Conversely, I have been able to deploy nodes inconsistently. On occasion, all three control plane nodes will connect with each other and then kernel panic/crash even when connected to each other. I've re-deployed the VMs several times following the docs here to no avail.

Proxmox Screenshot

image

Control Plane Node Crashing Screenshot

image

Logs - Working Control Plane

...
[  790.182302] [talos] boot sequence: done: 13m4.563542174s
[  995.332434] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[  995.338790] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "Crwl9hD0VYZcrU3t3kiXek2Bdz9+GDm7us7+OEP2bGo=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.10-10.200.0.10", "10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[  995.345317] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "mvemvwh7HlnBcqgZ4DTlA5azzz4k+Du9GUmvTpEKFGc=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1010.054517] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1010.058945] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "Crwl9hD0VYZcrU3t3kiXek2Bdz9+GDm7us7+OEP2bGo=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.10-10.200.0.10", "10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1010.064047] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "mvemvwh7HlnBcqgZ4DTlA5azzz4k+Du9GUmvTpEKFGc=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1120.318568] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1120.326671] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "Crwl9hD0VYZcrU3t3kiXek2Bdz9+GDm7us7+OEP2bGo=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.10-10.200.0.10", "10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1120.334032] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "mvemvwh7HlnBcqgZ4DTlA5azzz4k+Du9GUmvTpEKFGc=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1120.340243] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1120.344719] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "Crwl9hD0VYZcrU3t3kiXek2Bdz9+GDm7us7+OEP2bGo=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.10-10.200.0.10", "10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1120.349445] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "mvemvwh7HlnBcqgZ4DTlA5azzz4k+Du9GUmvTpEKFGc=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1120.353861] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 5}
[ 1180.317937] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1180.323549] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1180.329297] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 2}
[ 1782.127429] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1782.133832] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "dxFbh8dq8NHpZX9dwOLrsGvq/MWq8POeB7b+K1foiRw=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1782.138063] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[ 1782.140489] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 2}
[ 1797.843734] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1797.850747] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "dxFbh8dq8NHpZX9dwOLrsGvq/MWq8POeB7b+K1foiRw=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1904.006939] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1904.013925] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "dxFbh8dq8NHpZX9dwOLrsGvq/MWq8POeB7b+K1foiRw=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 1904.021010] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[ 1916.994621] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 1916.999011] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "dxFbh8dq8NHpZX9dwOLrsGvq/MWq8POeB7b+K1foiRw=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 2246.594590] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "SzlH0uFY/SNcZ2JUNERRb2Vvr4cRT4S2dR3CpTLulHk=", "other_peer": "ufrBQQMvxC+LkkmumpkihheUo23CGZPXuy7BSGJSATU=", "ignored_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"], "other_ips": ["10.200.0.251-10.200.0.251", "fd52:68bf:5286:c102:582f:57ff:fe66:145b-fd52:68bf:5286:c102:582f:57ff:fe66:145b"]}
[ 2246.600274] [talos] peer address overlap {"component": "controller-runtime", "controller": "kubespan.PeerSpecController", "ignored_peer": "yY0OtNGgCiEYy1XQ6fSG5HxOj2DA8bxOPK2LgtKjvw4=", "other_peer": "dxFbh8dq8NHpZX9dwOLrsGvq/MWq8POeB7b+K1foiRw=", "ignored_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"], "other_ips": ["10.200.0.247-10.200.0.247", "fd52:68bf:5286:c102:448b:abff:feb4:e6ee-fd52:68bf:5286:c102:448b:abff:feb4:e6ee"]}
[ 2380.306102] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[ 7540.248453] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 0}
[42387.770057] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 1}
[42393.916274] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 2}

Logs - Broken Worker

...
[    8.457314] [talos] phase startEverything (16/18): done, 1.193476866s
[    8.458546] [talos] phase uncordon (17/18): 1 tasks(s)
[    8.459574] [talos] task uncordonNode (1/1): starting
[    8.460869] [talos] retrying error: invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-c]
[   10.301157] [talos] service[kubelet](Running): Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
[   15.298126] [talos] service[kubelet](Running): Health check successful
[   35.693552] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[   65.690449] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[   95.689310] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  125.689697] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  140.034088] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  155.688853] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  185.688489] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  215.688896] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  245.688700] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  275.688078] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  305.687882] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "network.LinkSpecController", "link": "kubespan", "peers": 3}
[  308.440588] [talos] task uncordonNode (1/1): failed: 2 error(s) occurred:
[  308.441626]  invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no s]
[  308.445772]  timeout
[  308.446182] [talos] phase uncordon (17/18): failed
[  308.446835] [talos] boot sequence: failed
[  308.447459] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[  308.448694] [talos] service[machined](Finished): Service finished successfully
[  308.449822] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 1233, container kubelet)
[  308.451301] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 1187, container apid)
[  308.461451] [talos] service[udevd](Finished): Service finished successfully
[  308.479099] [talos] service[apid](Finished): Service finished successfully
[  308.480242] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" )
[  308.484245] [talos] service[containerd](Finished): Service finished successfully
[  308.582427] [talos] service[kubelet](Finished): Service finished successfully
[  308.583400] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[  308.610578] [talos] service[cri](Finished): Service finished successfully
[  308.611682] [talos] error running phase 17 in boot sequence: task 1/1: failed, 2 error(s) occurred:
[  308.613008]  invalid configuration: [unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no s]
[  308.617969]  timeout
[  308.618414] [talos] controller runtime finished
[  308.630642] [talos] panic=0 kernel flag found, sleeping forever
[  308.633573] [talos] unmounted / (/dev/loop0)
[  308.634362] XFS (sda5): Unmounting Filesystem
[  308.637081] [talos] unmounted /system/state (/dev/sda5)
[  308.638062] [talos] unmounted /var (/dev/sda6)
[  308.638799] [talos] unmounted /etc/cri/containerd.toml (/dev/sda6)
[  308.639757] [talos] unmounted /system/libexec/apid/apid (/dev/loop0)
[  308.640736] [talos] waiting for sync...
[  308.641560] [talos] sync done

Environment

Sanitized Control Plane Config - FUNCTIONAL

version: v1alpha1
debug: false
persist: true
machine:
  type: controlplane
  token: Fake.ControlPlaneToken
  ca:
    crt: CertGoesHere
    key: CertPrivateKeyGoesHere
  network:
    hostname: talos-cp01
    kubespan:
      enabled: true
    interfaces:
      - interface: eth0
        dhcp: true
        vip:
          ip: 10.200.0.10
    nameservers:
      - 10.200.0.1
      - 10.45.0.2
  install:
    disk: /dev/sda
    image: ghcr.io/talos-systems/installer:v0.14.2
    bootloader: true
    wipe: false
  features:
    rbac: true
cluster:
  id: NotTheRealID
  secret: NotTheRealSecret
  controlPlane:
    endpoint: https://homelab.kube.danmanners.io:6443
  clusterName: homelab
  network:
    cni:
      name: none
    dnsDomain: cluster.local
    podSubnets:
      - 10.244.0.0/16
    serviceSubnets:
      - 10.96.0.0/12
  token: Fake.TokenGoesHere
  aescbcEncryptionSecret: EncryptedSecretHere
  ca:
    crt: CACERT
    key: CACERT_KEY
  aggregatorCA:
    crt: AggregatorCaCertGoesHere
    key: AggregatorCaCertPrivateKeyGoesHere
  serviceAccount:
    key: ServiceAccountKeyGoesHere
  apiServer:
    certSANs:
      - homelab.kube.danmanners.io
      - kube.danmanners.io
  discovery:
    enabled: true
  etcd:
    ca:
      crt: EtcdCertGoesHere
      key: EtcdCertPrivateKeyGoesHere
      subnet: 10.200.0.0/24

Sanitized Worker Node Config - NON-FUNCTIONAL

version: v1alpha1
debug: false
persist: true
machine:
  type: worker
  token: NotTheRealTokenOkay
  ca:
    crt: NotTheRealCert
    key: ""
  network:
    hostname: talos-worker01
    kubespan:
      enabled: true
    interfaces:
      - interface: eth0
        dhcp: true
    nameservers:
      - 10.200.0.1
      - 10.45.0.2
  install:
    disk: /dev/sda
    image: ghcr.io/talos-systems/installer:v0.14.2
    bootloader: true
    wipe: false
  features:
    rbac: true
cluster:
  id: NotTheRealID
  secret: FakeSecret
  controlPlane:
    endpoint: https://homelab.kube.danmanners.io:6443
  network:
    dnsDomain: cluster.local
    podSubnets:
      - 10.244.0.0/16
    serviceSubnets:
      - 10.96.0.0/12
  token: Fake.Token
  aescbcEncryptionSecret: ""
  ca:
    crt: CACert
    key: ""
  discovery:
    enabled: true

Additional Notes

Happy to provide any and all other requested information.

smira commented 2 years ago

@danmanners looking at the logs, if you're re-using machine configuration across different clusters, this confuses cluster discovery and KubeSpan. It is important to use fresh secrets (.cluster.id) for each cluster, as Discovery Service might keep stale discovery date (from previous cluster) up to 30 minutes.

danmanners commented 2 years ago

@danmanners looking at the logs, if you're re-using machine configuration across different clusters, this confuses cluster discovery and KubeSpan. It is important to use fresh secrets (.cluster.id) for each cluster, as Discovery Service might keep stale discovery date (from previous cluster) up to 30 minutes.

Hey @smira, thanks for the reply. Unless I'm doing something incorrectly (which is entirely and highly possible), this is all for a single cluster in my homelab. The reason I'm attempting to use KubeSpan is because I will be spinning up nodes in AWS and Azure, and wanted to evaluate how well KubeSpan works for that use case. If that's the incorrect way to do things, I can try this again with KubeSpan disabled.

Thanks!

smira commented 2 years ago

@danmanners I'm talking about the KubeSpan warning in the logs about node IP overlap. So there are two possibilities:

If you actually have duplicate node IPs, which I doubt, that is not supported with Kubespan.

For the stale discovery data, see my comment above, if you generate a machine config, then spin up a cluster, destroy it, and spin up another cluster from the same machine config, Discovery Service will keep data from the previous cluster for up to 30 minutes, and that might lead to the error above. Workaround is to generate fresh machine config for each cluster you're deploying (even if the previous one is destroyed).

steverfrancis commented 1 year ago

Stale

danmanners commented 1 year ago

No longer necessary. Closing.