Closed hh closed 7 months ago
Managed to get three nodes provisioning and they seem to have the Talos config applied (according to OpenTofu).
Managed to use Stephen's work for Talos with Terraform and brought in Equinix Metal machines.
Managed to get the Kubeconfig with tofu output -raw kubeconfig
but can't access the APIServer.
Unsure why the Talosconfig isn't being passed as an output while it's explicitly being told to.
Going to need to rejig it, since the Talos provider is on 0.1.4 and latest is 0.4.0
Managed to rejig it to use Talos provider version 0.4.0.
But now getting the error
Error: cluster_endpoint is invalid
although it's set to https://145.40.90.158:6443
update:
var.cluster_virtual_ip
) and a fellow in the SideroLabs Slack quickly found it and I was able to correct itWould like to use either an external state to keep the deployment state for the Terraform, so it's not only stored locally.
https://opentofu.org/docs/language/state/remote-state-data/ https://opentofu.org/docs/language/settings/backends/configuration/
For some reason Cilium isn't coming up.
I'm seeing some logs in talosctl ... dmesg
regarding PodSecurity standards
user: warning: [2024-02-27T22:55:33.160923637Z]: [talos] Warning: would violate PodSecurity "restricted:latest": forbidden AppArmor profiles (container.apparmor.security.beta.kubernetes.io/cilium-agent="unconfined", container.apparmor.security.beta.kubernetes.io/clean-cilium-state="unconfined"), host namespaces (hostNetwork=true), privileged (container "mount-bpf-fs" must not set securityContext.privileged=true), seLinuxOptions (containers "clean-cilium-state", "install-cni-binaries", "cilium-agent" set forbidden securityContext.seLinuxOptions: type "spc_t"), allowPrivilegeEscalation != false (containers "config", "mount-bpf-fs", "clean-cilium-state", "install-cni-binaries", "cilium-agent" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "config", "mount-bpf-fs" must set securityContext.capabilities.drop=["ALL"]; containers "clean-cilium-state", "cilium-agent" must not include "CHOWN", "DAC_OVERRIDE", "FOWNER", "IPC_LOCK", "KILL", "NET_ADMIN", "NET_RAW", "SETG...
...
user: warning: [2024-02-27T22:55:34.193350637Z]: [talos] created apps/v1/DaemonSet/cilium {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
user: warning: [2024-02-27T22:55:34.342779637Z]: [talos] Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true), allowPrivilegeEscalation != false (container "cilium-operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "cilium-operator" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "cilium-operator" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "cilium-operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
...
user: warning: [2024-02-27T22:55:35.029651637Z]: [talos] created apps/v1/Deployment/cilium-operator {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
...
I know that the label pod-security.kubernetes.io/enforce: privileged
can be set on namespaces to resolve this but am unsure why it's occurring here when there's no other docs that I can find regarding Cilium and Kubernetes PodSecurity Standards.
update:
I've also moved the floating IP to be created in the Terraform.
Managed to get Cilium up. Needed to set the namespace to kube-system
in the helm_template resource.
Also needed to add enable kubePrism, according to the Talos Cilium docs: https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/#method-4-helm-manifests-inline-install
Seeing the following errors in the cloud provider controller for Equinix Metal
2024-02-28T00:36:46.642052047Z stderr F E0228 00:36:46.641967 1 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
it can't access the expected internal IP for the kube-apiserver
Using kubectl --server https://A_NODE_IP:6443 get pods -A
I was able to see the Pods.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-c6hjl 0/1 Init:CrashLoopBackOff 4 (19s ago) 7m24s
kube-system cilium-operator-8d757bc86-9bfk8 0/1 CrashLoopBackOff 4 (16s ago) 7m27s
kube-system cilium-operator-8d757bc86-ggxzb 1/1 Running 5 (25s ago) 7m27s
kube-system cilium-rkqg2 0/1 Init:CrashLoopBackOff 4 (30s ago) 7m26s
kube-system cilium-wpsx7 0/1 Init:CrashLoopBackOff 3 (15s ago) 5m44s
kube-system cloud-provider-equinix-metal-fkz7v 1/1 Running 0 5m38s
kube-system cloud-provider-equinix-metal-pl8zl 1/1 Running 0 7m16s
kube-system cloud-provider-equinix-metal-vxn5n 1/1 Running 0 7m13s
kube-system coredns-85b955d87b-6cs8t 0/1 Pending 0 7m27s
kube-system coredns-85b955d87b-zrftw 0/1 Pending 0 7m27s
kube-system kube-apiserver-cloudnative-coop-ubozyhge 1/1 Running 0 5m21s
kube-system kube-apiserver-cloudnative-coop-wgzmaxpf 1/1 Running 0 6m2s
kube-system kube-apiserver-cloudnative-coop-xgepbcdy 1/1 Running 0 4m15s
kube-system kube-controller-manager-cloudnative-coop-ubozyhge 1/1 Running 2 (7m58s ago) 6m20s
kube-system kube-controller-manager-cloudnative-coop-wgzmaxpf 1/1 Running 2 (7m58s ago) 6m35s
kube-system kube-controller-manager-cloudnative-coop-xgepbcdy 1/1 Running 0 4m25s
kube-system kube-scheduler-cloudnative-coop-ubozyhge 1/1 Running 2 (7m58s ago) 6m30s
kube-system kube-scheduler-cloudnative-coop-wgzmaxpf 1/1 Running 2 (7m57s ago) 6m12s
kube-system kube-scheduler-cloudnative-coop-xgepbcdy 1/1 Running 0 4m27s
And getting the logs for a Cilium DaemonSet Pod
$ kubectl --server https://A_NODE_IP:6443 -n kube-system logs cilium-wpsx7 -p -c config
level=error msg="Unable to contact k8s api-server" ... dial tcp ... i/o timeout
...
Usage:
cilium build-config --node-name $K8S_NODE_NAME [flags]
Flags:
--allow-config-keys strings List of configuration keys that are allowed to be overridden (e.g. set from not the first source. Takes precedence over deny-config-keys
--deny-config-keys strings List of configuration keys that are not allowed to be overridden (e.g. set from not the first source. If allow-config-keys is set, this field is ignored
--dest string Destination directory to write the fully-resolved configuration. (default "/tmp/cilium/config-map")
--enable-k8s-api-discovery Enable discovery of Kubernetes API groups and resources with the discovery API
-h, --help help for build-config
--k8s-api-server string Kubernetes API server URL
--k8s-client-burst int Burst value allowed for the K8s client
--k8s-client-qps float32 Queries per second limit for the K8s client
--k8s-heartbeat-timeout duration Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
--k8s-kubeconfig-path string Absolute path of the kubernetes kubeconfig file
--node-name string The name of the node on which we are running. Also set via K8S_NODE_NAME environment. (default "cloudnative-coop-xgepbcdy")
--source strings Ordered list of configuration sources. Supported values: config-map:<namespace>/name - a ConfigMap with <name>, optionally in namespace <namespace>. cilium-node-config:<NAMESPACE> - any CiliumNodeConfigs in namespace <NAMESPACE>. node:<NODENAME> - Annotations on the node. Namespace and nodename are optional (default [config-map:cilium-config,cilium-node-config:kube-system])
Global Flags:
--config string Config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
...
Trying without Cilium now
Things looking good just using Flannel
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cloud-provider-equinix-metal-4r8cr 1/1 Running 0 3m27s
kube-system cloud-provider-equinix-metal-cf6rv 1/1 Running 0 3m48s
kube-system cloud-provider-equinix-metal-nn7qq 1/1 Running 0 3m35s
kube-system coredns-85b955d87b-jc2dg 1/1 Running 0 3m55s
kube-system coredns-85b955d87b-znl2n 1/1 Running 0 3m55s
kube-system kube-apiserver-cloudnative-coop-afewijmk 1/1 Running 0 3m19s
kube-system kube-apiserver-cloudnative-coop-aubvymdx 1/1 Running 0 2m25s
kube-system kube-apiserver-cloudnative-coop-jarhigss 1/1 Running 0 3m20s
kube-system kube-controller-manager-cloudnative-coop-afewijmk 1/1 Running 2 (4m12s ago) 2m46s
kube-system kube-controller-manager-cloudnative-coop-aubvymdx 1/1 Running 2 (4m27s ago) 2m38s
kube-system kube-controller-manager-cloudnative-coop-jarhigss 1/1 Running 2 (4m28s ago) 2m49s
kube-system kube-flannel-l6dlm 1/1 Running 0 3m33s
kube-system kube-flannel-qk5sw 1/1 Running 0 3m50s
kube-system kube-flannel-sgzht 1/1 Running 0 3m54s
kube-system kube-proxy-847hg 1/1 Running 0 3m50s
kube-system kube-proxy-brtsg 1/1 Running 0 3m33s
kube-system kube-proxy-rlms7 1/1 Running 0 3m54s
kube-system kube-scheduler-cloudnative-coop-afewijmk 1/1 Running 2 (4m12s ago) 2m26s
kube-system kube-scheduler-cloudnative-coop-aubvymdx 1/1 Running 2 (4m27s ago) 2m38s
kube-system kube-scheduler-cloudnative-coop-jarhigss 1/1 Running 2 (4m29s ago) 2m46s
though the floating IP still isn't being assigned by the cloud provider controller for Equinix Metal. Think a few other settings need to be tweaked
Trying to use the Equinix Metal Load Balancer setting
Added loadBalancer and loadBalancerID fields to the config and now seeing logs like this
I0228 01:23:17.685596 1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
Error syncing load balancer: failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
>
I0228 01:23:59.315458 1 controlplane_load_balancer_manager.go:98] handling update, endpoints: default/kubernetes
I0228 01:23:59.315470 1 controlplane_load_balancer_manager.go:134] handling update, service: default/kubernetes
Trying to curl
the endpoint manually, it appears certain that the token is not allowed to call this endpoint
π(2) curl -X POST -H "Authorization: Bearer $METAL_AUTH_TOKEN" https://iam.metalctrl.io/api-keys/exchange
{"message":"Unauthorized"}
Should we open an issue with Equinix?
An issue has been opened with Equinix Metal.
Currently to get around it, I've manually assigned the floating IP to one of the nodes and have set the following fields to configure Talos on the nodes for the lo interface binding to the address
machine:
network:
hostname: ${each.value.hostname}
interfaces:
- interface: lo
addresses:
- ${equinix_metal_reserved_ip_block.cluster_virtual_ip.network}
Now I can talk to the control plane on the floating IP address. Though, the most ideal is that the Equinix Metal Cloud Provider controller takes care of it and it stays out of scope of cluster configuration, given it's implementation specific and variable
cert-manager is installed. rook-ceph is installing.
rook-ceph is installed
π(2) kubectl get cephclusters.ceph.rook.io -A
NAMESPACE NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
rook-ceph rook-ceph /var/lib/rook 3 8m43s Ready Cluster created successfully HEALTH_OK 36e48beb-8681-47fb-bf42-6fe8bf945e60
istio installed.
But seeing these events in the istio-gateway service
...
Normal EnsuringLoadBalancer 35s (x6 over 3m10s) service-controller Ensuring load balancer
Warning SyncLoadBalancerFailed 35s (x4 over 3m10s) service-controller Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]
and I've set metro and loadbalancer like this now https://github.com/ii/infra/commit/7d67e22f6b7e47226fcebcb7dcfb9f67097c6789
still seeing these logs from the Equinix Metal Cloud Provider controller
E0229 00:47:00.643524 1 controller.go:289] error processing service istio-system/istio-gateway (will retry): failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]
I0229 00:47:00.643564 1 event.go:294] "Event occurred" object="istio-system/istio-gateway" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0229 00:47:00.643588 1 event.go:294] "Event occurred" object="istio-system/istio-gateway" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]"
I0229 00:47:01.162158 1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0229 00:47:01.857872 1 controller.go:289] error processing service kube-system/cloud-provider-equinix-metal-kubernetes-external (will retry): failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
I0229 00:47:01.857936 1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
Error syncing load balancer: failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
It appears that the loadbalancer value needed three slashes https://github.com/ii/infra/commit/a3617e93ff1ea1ed5d094d9fee6b0a642d78640b
now seeing events like this in the Istio Service
...
Normal EnsuringLoadBalancer 46s (x7 over 6m3s) service-controller Ensuring load balancer
Warning SyncLoadBalancerFailed 45s (x7 over 6m2s) service-controller Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: token exchange request failed with status 401, body {"message":"Unauthorized"}
now using the domain for apiserver
https://github.com/ii/infra/commit/89345d005dc78ee741efffc5da8d528d3270984f
Setting loadbalancer field in metal-cloud-config to empty:///
causes Equinix Metal Cloud Provider controller to allocate an IP address and assign it to the LoadBalancer type Service, however it doesn't appear to be routed.
When assigning the IP to a node manually, getting the following error attempting to talk to Istio 80 (http): Connection refused
Thanks for flagging this @BobyMCbobs - and very much appreciate the running commentary on what you're running into. We'll be reviewing the docs to make it easier for the next person through.
There was a bug in the readme.md for CPEM. The loadBalancer field should be emlb:///da (or whatever metro you're using, so emlb:///sv or wherever). The diff for the docs is here:
One of the things we need to do is to update cpem to enable DEBUG logs. It would help if you could provide such debug logs from around the 401 error at https://github.com/ii/infra/issues/1#issuecomment-1970194077
Hey can you set the following env variable on the cloud-provider-equinix-metal daemonset's container spec and do a rollout restart on it. This should give some debug logs for the token exchanger that I think would be helpful if you could sanitize (remove the tokens) and share.
PACKNGO_DEBUG = 1
@cprivitere, I wasn't able to see any additional logs with this set
Using a user token instead of a project token appears to resolve any Equinix Metal API errors in the controller logs. However, it would be much better to use a project token.
Got DNS for the cluster domain pointing to the floating IP through rf2136 support in the DNS provider.
Using the hacky work around of attaching through the Terraform (https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L31-L35) instead of relying on the Equinix Metal Cloud Provider controller.
Currently using an eipTag which matches the floating IP tag https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L28 + https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L101 and seeing the following logs
...
I0301 01:23:26.762167 1 eip_controlplane_reconciliation.go:553] doHealthCheck(): checking control plane health through ip https://139.178.88.139:6443/healthz
I0301 01:23:26.763511 1 eip_controlplane_reconciliation.go:572] doHealthCheck(): health check through elastic ip failed, trying to reassign to an available controlplane node
I0301 01:23:26.767525 1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.89.13:6443/healthz
I0301 01:23:26.769085 1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-rtulfevt: returned http code 401
I0301 01:23:26.769098 1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.90.247:6443/healthz
I0301 01:23:26.770239 1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-xcpvkwuc: returned http code 401
I0301 01:23:26.770249 1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.90.201:6443/healthz
I0301 01:23:26.771406 1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-ifivestw: returned http code 401
E0301 01:23:26.771444 1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: ccm didn't find a good candidate for IP allocation. Cluster is unhealthy
I0301 01:23:26.771466 1 eip_controlplane_reconciliation.go:125] handling update, node: sharing-io-rtulfevt
...
which results in a 401, because it is an anonymous request (https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/3fb9b358ecaa037ed271b29246ef8f9b29bab77a/metal/eip_controlplane_reconciliation.go#L250) instead of an authenticated request. The controller Pods have a projected token in which they may be able to use for authenticated requests to healthz/readyz/livez etc...
Trying to use Cilium with Talos and I set up KubePrism along with pointing Cilium to it
π talosctl --talosconfig ~/.talos/config-cloudnative-coop -n x.x.x.x get machineconfig -o yaml 2>/dev/null | yq e .spec.machine.features.kubePrism
enabled: true
port: 7445
However, Iβm seeing these kinda logs
π kubectl -n kube-system logs cilium-bx9fx -c config
...
level=error msg="Unable to contact k8s api-server" error="Get \"https://localhost:7443/api/v1/namespaces/kube-system\": dial tcp [::1]:7443: connect: connection refused" ipAddr="https://localhost:7443" subsys=k8s-client
level=error msg="Start hook failed" error="Get \"https://localhost:7443/api/v1/namespaces/kube-system\": dial tcp [::1]:7443: connect: connection refused" function="client.(*compositeClientset).onStart" subsys=hive
Error: failed to start: Get "https://localhost:7443/api/v1/namespaces/kube-system": dial tcp [::1]:7443: connect: connection refused
...
and it appears that it's not able to access the KubePrism endpoint for some reason
It doesn't appear that resources in inline manifests are ever updated after bootstrap https://www.talos.dev/v1.6/reference/configuration/v1alpha1/config/#Config.cluster.inlineManifests.
edit:
If you need to update a manifest make sure to first edit all control plane machine configurations and then run talosctl upgrade-k8s as it will take care of updating inline manifests
Yeah Iβd expect that to only happen at the initial bootstrap. Youβd have to deploy the talos machines again? (Or manually apply)
On Sun, 3 Mar 2024 at 15:14, Caleb Woodbine @.***> wrote:
It doesn't appear that resources in inline manifests are ever updated after bootstrap
https://www.talos.dev/v1.6/reference/configuration/v1alpha1/config/#Config.cluster.inlineManifests .
β Reply to this email directly, view it on GitHub https://github.com/ii/infra/issues/1#issuecomment-1975376716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHUY54VPIJCYYD3VTYPALYWOOFRAVCNFSM6AAAAABD243LW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGM3TMNZRGY . You are receiving this because you authored the thread.Message ID: @.***>
Cilium is now working on newly created clusters.
I'm trying to manually set the Tofu-provisioned floating IP address externalIPs and loadBalancerIP in the istio-gateway Service, but nothing am still getting a connection refused message. This is the same IP as the control plane (which is listening on :6443).
Attempting with MetalLB + CPEM and currently seeing no IPs being allocated
π kubectl get svc -A
...
istio-system istio-gateway LoadBalancer 10.101.197.94 <pending> 80:30102/TCP,443:32694/TCP 3h55m
istio-system istiod ClusterIP 10.107.149.180 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 3h56m
kube-system cloud-provider-equinix-metal-kubernetes-external LoadBalancer 10.100.120.223 <pending> 443:32454/TCP 110m
...
https://github.com/ii/infra/commit/23f1ca950a77c61213f5c63986ded515dca16484
although, the ConfigMap in metallb-system/config is being populated by CPEM.
Istio Gateway is not responding to 80 or 443 currently and I'm trying to configure a gateway to make it resolve but no progress major progress yet.
Very specific ordering for tofu import
tofu import -var-file=./values.tfvars -var equinix_metal_auth_token=$METAL_AUTH_TOKEN -var github_token="$(gh auth token)" module.sharing-io-manifests.kubernetes_namespace.kube-system kube-system
Created the Talos Factory schematic with ID 0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6. It contains
running the following command on each node to manually perform an upgrade to it
talosctl --talosconfig ~/.talos/config-sharing-io -n NODE_IP upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6
For some reason the servers were provisioned with v1.6.0 and no extensions... not sure why.
Get all Talos resources
talosctl --talosconfig ~/.talos/config-sharing-io -n A_NODE_IP get rd
Another issues is that on first run the equinix_metal + bpg_neighbors that we need to configure the talos routes bombs the first time through.
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β 95: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "0"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β 95: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "1"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β 95: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "2"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β 97: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "0"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β 97: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "1"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
β·
β Error: Invalid index
β
β on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β 97: - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β βββββββββββββββββ
β β data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β β each.key is "2"
β
β The given key does not identify an element in this collection value: the collection has no elements.
β΅
Would be good to set kernel args for talos so we can see boot messages on the Equinix serial recovery / console. This is what an AlmaOS box does.
[root@c3-medium-x86-01 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz root=UUID=96816452-f524-4d80-a493-aeebe04e2cf5 ro console=tty0 console=ttyS1,115200n8 rd.auto rd.auto=1
Would be good to configure talos networking to bond enp65sf0f1 and enp65sf0f0:
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp65s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff
altname ens3f0
3: enp65s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff permaddr 40:a6:b7:6f:78:a1
altname ens3f1
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff
inet 139.178.70.133/31 scope global noprefixroute bond0
valid_lft forever preferred_lft forever
inet 10.67.90.135/31 scope global noprefixroute bond0:0
valid_lft forever preferred_lft forever
inet6 2604:1380:45e3:4300::7/127 scope global noprefixroute
valid_lft forever preferred_lft forever
inet6 fe80::42a6:b7ff:fe6f:78a0/64 scope link noprefixroute
valid_lft forever preferred_lft forever
Upon first convergence, it would be nice before kubernetes is fully up (CCM / CPEM) for talos to assign the apiserver EIP to one of the nodes.
Does this have something to do with that?
vip:
ip: ${equinix_metal_reserved_ip_block.cluster_ingress_ip.address}
equinixMetal:
apiToken: ${var.equinix_metal_auth_token}
We now have one of our talos nodes upgraded to 1.6.6:
hh@m1 infra % talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.39 upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6
WARNING: 147.75.71.39: server version 1.6.0 is older than client version 1.6.6
β° β³ watching nodes: [147.75.71.39]
* 147.75.71.39: stage: BOOTING ready: true unmetCond: []
Signal received, aborting, press Ctrl+C once again to abort immediately...
hh@m1 infra % kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
sharing-io-icvoiyes Ready control-plane 21m v1.29.2 147.75.71.39 <none> Talos (v1.6.6) 6.1.80-talos containerd://1.7.13
sharing-io-owyapupv Ready control-plane 21m v1.29.2 147.75.71.237 <none> Talos (v1.6.0) 6.1.67-talos containerd://1.7.10
sharing-io-wagfsaif Ready control-plane 21m v1.29.2 147.75.49.11 <none> Talos (v1.6.0) 6.1.67-talos containerd://1.7.10
The upgrade wouldn't work start do to some etcd members not being healthy after the upgrade of the first node.
h@m1 infra % talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.237 upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6
WARNING: 147.75.71.237: server version 1.6.0 is older than client version 1.6.6
β° watching nodes: [147.75.71.237]
* 147.75.71.237: rpc error: code = DeadlineExceeded desc = error validating etcd for upgrade: etcd member 5ecbd721c01aa1eb is not healthy; all members must be healthy to perform an upgrade: context deadline
exceeded
I wonder how we can help that member be healthy and have the other two connect!
talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.39,147.75.71.237,147.75.49.11 etcd status
WARNING: 2 errors occurred:
* 147.75.49.11: rpc error: code = DeadlineExceeded desc = failed to create etcd client: error building etcd client: context deadline exceeded
* 147.75.71.237: rpc error: code = DeadlineExceeded desc = failed to create etcd client: error building etcd client: context deadline exceeded
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
147.75.71.39 5ecbd721c01aa1eb 3.6 MB 1.4 MB (39.86%) 5ecbd721c01aa1eb 4612 3 4612 false
Might be good to separate these into modules:
Figured it was worth ensuring Talos was widely aware of this major change: https://github.com/siderolabs/talos/issues/8411
I was able to explicitly specify the installer version here: https://github.com/ii/infra/commit/44bdd4f66a2fafcbfb07f16f3cc6b36c8e576898
Which seems to have enabled the installer and extensions:
kubectl --server https://147.75.71.255:6443 get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
sharing-io-liouumoj Ready control-plane 29m v1.29.2 147.75.49.25 <none> Talos (v1.6.6) 6.1.80-talos containerd://1.7.13
sharing-io-ruewrpcf Ready control-plane 29m v1.29.2 147.75.71.255 <none> Talos (v1.6.6) 6.1.80-talos containerd://1.7.13
sharing-io-vfwhaqop Ready control-plane 29m v1.29.2 147.75.71.237 <none> Talos (v1.6.6) 6.1.80-talos containerd://1.7.13
talosctl -n 147.75.71.255,147.75.71.237,147.75.49.25 get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
147.75.71.255 runtime ExtensionStatus 0 1 iscsi-tools v0.1.4
147.75.71.255 runtime ExtensionStatus 1 1 mdadm v4.2-v1.6.6
147.75.71.255 runtime ExtensionStatus 2 1 zfs 2.1.14-v1.6.6
147.75.71.255 runtime ExtensionStatus 3 1 gvisor 20231214.0-v1.6.6
147.75.71.255 runtime ExtensionStatus 4 1 schematic 0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.71.255 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.80-talos
147.75.71.237 runtime ExtensionStatus 0 1 iscsi-tools v0.1.4
147.75.71.237 runtime ExtensionStatus 1 1 mdadm v4.2-v1.6.6
147.75.71.237 runtime ExtensionStatus 2 1 zfs 2.1.14-v1.6.6
147.75.71.237 runtime ExtensionStatus 3 1 gvisor 20231214.0-v1.6.6
147.75.71.237 runtime ExtensionStatus 4 1 schematic 0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.71.237 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.80-talos
147.75.49.25 runtime ExtensionStatus 0 1 iscsi-tools v0.1.4
147.75.49.25 runtime ExtensionStatus 1 1 mdadm v4.2-v1.6.6
147.75.49.25 runtime ExtensionStatus 2 1 zfs 2.1.14-v1.6.6
147.75.49.25 runtime ExtensionStatus 3 1 gvisor 20231214.0-v1.6.6
147.75.49.25 runtime ExtensionStatus 4 1 schematic 0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.49.25 runtime ExtensionStatus modules.dep 1 modules.dep 6.1.80-talos
Talos on Equinix
Cluster Configuration
Similar to https://github.com/cloudnative-coop/infra/tree/canon/clusters/sv1.cloudnative.coop