siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
5.75k stars 466 forks source link

Controller Failing to Start on Fresh Install #4078

Closed drcannoli closed 2 years ago

drcannoli commented 2 years ago

Bug Report

Description

Fresh install on new Proxmox instance seems to fail to start due to the k8s controller failing to start. Used instructions from here https://www.talos.dev/docs/v0.12/virtualized-platforms/proxmox/. Tried with version 0.11.5 as well as changing the kubernetes image versions.

Commands to get cluster going were:

talosctl gen config talos-cluster https://192.168.0.107:6443 --output-dir _out
talosctl apply-config --insecure --nodes 192.168.0.107 --file _out/controlplane.yaml

Only change made to controlplane.yaml was debug: true

Logs

Full Logs: https://pastebin.com/kzn6G67P The below is the end snippet that keeps repeating as the controller keeps restarting

192.168.0.107: user: warning: [2021-08-15T06:11:37.504754831Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\x5c"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\x5c") has prevented the request from succeeding"}
192.168.0.107: user: warning: [2021-08-15T06:11:37.509862831Z]: [talos] restarting controller in 289.488593ms {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}   
192.168.0.107: user: warning: [2021-08-15T06:11:37.802523831Z]: [talos] controller starting {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}
192.168.0.107: kern:    info: [2021-08-15T06:11:50.974341831Z]: nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
192.168.0.107: user: warning: [2021-08-15T06:11:52.803991831Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\x5c"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\x5c") has prevented the request from succeeding"}
192.168.0.107: user: warning: [2021-08-15T06:11:52.809389831Z]: [talos] restarting controller in 747.171849ms {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}   
192.168.0.107: user: warning: [2021-08-15T06:11:53.559251831Z]: [talos] controller starting {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"}

Environment

drcannoli commented 2 years ago

Nvm got it working. Seems the first node I needed to switch to type: init under machine. Must've missed a step in the docs, was under the impression the type init no longer needed to be used.

andrewrynhard commented 2 years ago

@drcannoli When it was controlplane instead of init did you run talosctl bootstrap?

drcannoli commented 2 years ago

No @andrewrynhard same commands. Just changed the following

version: v1alpha1 # Indicates the schema used to decode the contents.
debug: true # Enable verbose logging to the console.
persist: true # Indicates whether to pull the machine config upon every boot.
# Provides machine specific configuration options.
machine:
    type: controlplane # Defines the role of the machine within the cluster.

TO

version: v1alpha1 # Indicates the schema used to decode the contents.
debug: true # Enable verbose logging to the console.
persist: true # Indicates whether to pull the machine config upon every boot.
# Provides machine specific configuration options.
machine:
    type: init # Defines the role of the machine within the cluster. <---- this line

Exact same commands as before were issued

talosctl gen config talos-cluster https://192.168.0.107:6443 --output-dir _out
talosctl apply-config --insecure --nodes 192.168.0.107 --file _out/controlplane.yaml

Afterwards I joined 2 more nodes successfully changing back to type: controlplane

Ah I misread your comment. No I never ran talosctl bootstrap

andrewrynhard commented 2 years ago

Bootstrapping is required when not using init type

drcannoli commented 2 years ago

Yup, I found where I missed the steps in the docs. Thanks!