ii / infra

2 stars 4 forks source link

Multi-node sharing.io cluster on Equinix Hong Kong #1

Closed hh closed 7 months ago

hh commented 9 months ago

Talos on Equinix

Cluster Configuration

Similar to https://github.com/cloudnative-coop/infra/tree/canon/clusters/sv1.cloudnative.coop

BobyMCbobs commented 9 months ago

Managed to get three nodes provisioning and they seem to have the Talos config applied (according to OpenTofu).

Screenshot 2024-02-27 at 15 03 03

Managed to use Stephen's work for Talos with Terraform and brought in Equinix Metal machines.

Managed to get the Kubeconfig with tofu output -raw kubeconfig but can't access the APIServer. Unsure why the Talosconfig isn't being passed as an output while it's explicitly being told to.

BobyMCbobs commented 9 months ago

Going to need to rejig it, since the Talos provider is on 0.1.4 and latest is 0.4.0

BobyMCbobs commented 9 months ago

Managed to rejig it to use Talos provider version 0.4.0.

But now getting the error

Error: cluster_endpoint is invalid

although it's set to https://145.40.90.158:6443

update:

BobyMCbobs commented 9 months ago

Would like to use either an external state to keep the deployment state for the Terraform, so it's not only stored locally.

https://opentofu.org/docs/language/state/remote-state-data/ https://opentofu.org/docs/language/settings/backends/configuration/

BobyMCbobs commented 9 months ago

For some reason Cilium isn't coming up. I'm seeing some logs in talosctl ... dmesg regarding PodSecurity standards

user: warning: [2024-02-27T22:55:33.160923637Z]: [talos] Warning: would violate PodSecurity "restricted:latest": forbidden AppArmor profiles (container.apparmor.security.beta.kubernetes.io/cilium-agent="unconfined", container.apparmor.security.beta.kubernetes.io/clean-cilium-state="unconfined"), host namespaces (hostNetwork=true), privileged (container "mount-bpf-fs" must not set securityContext.privileged=true), seLinuxOptions (containers "clean-cilium-state", "install-cni-binaries", "cilium-agent" set forbidden securityContext.seLinuxOptions: type "spc_t"), allowPrivilegeEscalation != false (containers "config", "mount-bpf-fs", "clean-cilium-state", "install-cni-binaries", "cilium-agent" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "config", "mount-bpf-fs" must set securityContext.capabilities.drop=["ALL"]; containers "clean-cilium-state", "cilium-agent" must not include "CHOWN", "DAC_OVERRIDE", "FOWNER", "IPC_LOCK", "KILL", "NET_ADMIN", "NET_RAW", "SETG...
...
user: warning: [2024-02-27T22:55:34.193350637Z]: [talos] created apps/v1/DaemonSet/cilium {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
user: warning: [2024-02-27T22:55:34.342779637Z]: [talos] Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true), allowPrivilegeEscalation != false (container "cilium-operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "cilium-operator" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "cilium-operator" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "cilium-operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
...
user: warning: [2024-02-27T22:55:35.029651637Z]: [talos] created apps/v1/Deployment/cilium-operator {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"}
...

I know that the label pod-security.kubernetes.io/enforce: privileged can be set on namespaces to resolve this but am unsure why it's occurring here when there's no other docs that I can find regarding Cilium and Kubernetes PodSecurity Standards.

update:

BobyMCbobs commented 9 months ago

I've also moved the floating IP to be created in the Terraform.

BobyMCbobs commented 9 months ago

Managed to get Cilium up. Needed to set the namespace to kube-system in the helm_template resource. Also needed to add enable kubePrism, according to the Talos Cilium docs: https://www.talos.dev/v1.6/kubernetes-guides/network/deploying-cilium/#method-4-helm-manifests-inline-install

BobyMCbobs commented 9 months ago

Seeing the following errors in the cloud provider controller for Equinix Metal

2024-02-28T00:36:46.642052047Z stderr F E0228 00:36:46.641967       1 leaderelection.go:330] error retrieving resource lock kube-system/cloud-controller-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

it can't access the expected internal IP for the kube-apiserver

BobyMCbobs commented 9 months ago

Using kubectl --server https://A_NODE_IP:6443 get pods -A I was able to see the Pods.

NAMESPACE     NAME                                                READY   STATUS                  RESTARTS        AGE
kube-system   cilium-c6hjl                                        0/1     Init:CrashLoopBackOff   4 (19s ago)     7m24s
kube-system   cilium-operator-8d757bc86-9bfk8                     0/1     CrashLoopBackOff        4 (16s ago)     7m27s
kube-system   cilium-operator-8d757bc86-ggxzb                     1/1     Running                 5 (25s ago)     7m27s
kube-system   cilium-rkqg2                                        0/1     Init:CrashLoopBackOff   4 (30s ago)     7m26s
kube-system   cilium-wpsx7                                        0/1     Init:CrashLoopBackOff   3 (15s ago)     5m44s
kube-system   cloud-provider-equinix-metal-fkz7v                  1/1     Running                 0               5m38s
kube-system   cloud-provider-equinix-metal-pl8zl                  1/1     Running                 0               7m16s
kube-system   cloud-provider-equinix-metal-vxn5n                  1/1     Running                 0               7m13s
kube-system   coredns-85b955d87b-6cs8t                            0/1     Pending                 0               7m27s
kube-system   coredns-85b955d87b-zrftw                            0/1     Pending                 0               7m27s
kube-system   kube-apiserver-cloudnative-coop-ubozyhge            1/1     Running                 0               5m21s
kube-system   kube-apiserver-cloudnative-coop-wgzmaxpf            1/1     Running                 0               6m2s
kube-system   kube-apiserver-cloudnative-coop-xgepbcdy            1/1     Running                 0               4m15s
kube-system   kube-controller-manager-cloudnative-coop-ubozyhge   1/1     Running                 2 (7m58s ago)   6m20s
kube-system   kube-controller-manager-cloudnative-coop-wgzmaxpf   1/1     Running                 2 (7m58s ago)   6m35s
kube-system   kube-controller-manager-cloudnative-coop-xgepbcdy   1/1     Running                 0               4m25s
kube-system   kube-scheduler-cloudnative-coop-ubozyhge            1/1     Running                 2 (7m58s ago)   6m30s
kube-system   kube-scheduler-cloudnative-coop-wgzmaxpf            1/1     Running                 2 (7m57s ago)   6m12s
kube-system   kube-scheduler-cloudnative-coop-xgepbcdy            1/1     Running                 0               4m27s

And getting the logs for a Cilium DaemonSet Pod

$ kubectl --server https://A_NODE_IP:6443 -n kube-system logs cilium-wpsx7 -p -c config
level=error msg="Unable to contact k8s api-server" ... dial tcp ... i/o timeout
...
Usage:
  cilium build-config --node-name $K8S_NODE_NAME [flags]

Flags:
      --allow-config-keys strings        List of configuration keys that are allowed to be overridden (e.g. set from not the first source. Takes precedence over deny-config-keys
      --deny-config-keys strings         List of configuration keys that are not allowed to be overridden (e.g. set from not the first source. If allow-config-keys is set, this field is ignored
      --dest string                      Destination directory to write the fully-resolved configuration. (default "/tmp/cilium/config-map")
      --enable-k8s-api-discovery         Enable discovery of Kubernetes API groups and resources with the discovery API
  -h, --help                             help for build-config
      --k8s-api-server string            Kubernetes API server URL
      --k8s-client-burst int             Burst value allowed for the K8s client
      --k8s-client-qps float32           Queries per second limit for the K8s client
      --k8s-heartbeat-timeout duration   Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string       Absolute path of the kubernetes kubeconfig file
      --node-name string                 The name of the node on which we are running. Also set via K8S_NODE_NAME environment. (default "cloudnative-coop-xgepbcdy")
      --source strings                   Ordered list of configuration sources. Supported values: config-map:<namespace>/name - a ConfigMap with <name>, optionally in namespace <namespace>. cilium-node-config:<NAMESPACE> - any CiliumNodeConfigs in namespace <NAMESPACE>.  node:<NODENAME> - Annotations on the node. Namespace and nodename are optional (default [config-map:cilium-config,cilium-node-config:kube-system])

Global Flags:
      --config string   Config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
...
BobyMCbobs commented 9 months ago

Trying without Cilium now

BobyMCbobs commented 9 months ago

Things looking good just using Flannel

NAMESPACE     NAME                                                READY   STATUS    RESTARTS        AGE
kube-system   cloud-provider-equinix-metal-4r8cr                  1/1     Running   0               3m27s
kube-system   cloud-provider-equinix-metal-cf6rv                  1/1     Running   0               3m48s
kube-system   cloud-provider-equinix-metal-nn7qq                  1/1     Running   0               3m35s
kube-system   coredns-85b955d87b-jc2dg                            1/1     Running   0               3m55s
kube-system   coredns-85b955d87b-znl2n                            1/1     Running   0               3m55s
kube-system   kube-apiserver-cloudnative-coop-afewijmk            1/1     Running   0               3m19s
kube-system   kube-apiserver-cloudnative-coop-aubvymdx            1/1     Running   0               2m25s
kube-system   kube-apiserver-cloudnative-coop-jarhigss            1/1     Running   0               3m20s
kube-system   kube-controller-manager-cloudnative-coop-afewijmk   1/1     Running   2 (4m12s ago)   2m46s
kube-system   kube-controller-manager-cloudnative-coop-aubvymdx   1/1     Running   2 (4m27s ago)   2m38s
kube-system   kube-controller-manager-cloudnative-coop-jarhigss   1/1     Running   2 (4m28s ago)   2m49s
kube-system   kube-flannel-l6dlm                                  1/1     Running   0               3m33s
kube-system   kube-flannel-qk5sw                                  1/1     Running   0               3m50s
kube-system   kube-flannel-sgzht                                  1/1     Running   0               3m54s
kube-system   kube-proxy-847hg                                    1/1     Running   0               3m50s
kube-system   kube-proxy-brtsg                                    1/1     Running   0               3m33s
kube-system   kube-proxy-rlms7                                    1/1     Running   0               3m54s
kube-system   kube-scheduler-cloudnative-coop-afewijmk            1/1     Running   2 (4m12s ago)   2m26s
kube-system   kube-scheduler-cloudnative-coop-aubvymdx            1/1     Running   2 (4m27s ago)   2m38s
kube-system   kube-scheduler-cloudnative-coop-jarhigss            1/1     Running   2 (4m29s ago)   2m46s

though the floating IP still isn't being assigned by the cloud provider controller for Equinix Metal. Think a few other settings need to be tweaked

BobyMCbobs commented 9 months ago

Trying to use the Equinix Metal Load Balancer setting

https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/036c57b7ad4585b33c5cb8dbf490d6f17cc3b05e/README.md?plain=1#L602

BobyMCbobs commented 9 months ago

Added loadBalancer and loadBalancerID fields to the config and now seeing logs like this

I0228 01:23:17.685596       1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
        Error syncing load balancer: failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
 >
I0228 01:23:59.315458       1 controlplane_load_balancer_manager.go:98] handling update, endpoints: default/kubernetes
I0228 01:23:59.315470       1 controlplane_load_balancer_manager.go:134] handling update, service: default/kubernetes
BobyMCbobs commented 9 months ago

It appears to be failing here https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/036c57b7ad4585b33c5cb8dbf490d6f17cc3b05e/metal/loadbalancers/emlb/infrastructure/token_exchanger.go#L38

BobyMCbobs commented 9 months ago

Trying to curl the endpoint manually, it appears certain that the token is not allowed to call this endpoint

🐚(2) curl -X POST -H "Authorization: Bearer $METAL_AUTH_TOKEN" https://iam.metalctrl.io/api-keys/exchange
{"message":"Unauthorized"}
hh commented 9 months ago

Should we open an issue with Equinix?

BobyMCbobs commented 9 months ago

An issue has been opened with Equinix Metal.

Currently to get around it, I've manually assigned the floating IP to one of the nodes and have set the following fields to configure Talos on the nodes for the lo interface binding to the address

machine:
   network:
     hostname: ${each.value.hostname}
     interfaces:
       - interface: lo
         addresses:
           - ${equinix_metal_reserved_ip_block.cluster_virtual_ip.network}

Now I can talk to the control plane on the floating IP address. Though, the most ideal is that the Equinix Metal Cloud Provider controller takes care of it and it stays out of scope of cluster configuration, given it's implementation specific and variable

BobyMCbobs commented 9 months ago

cert-manager is installed. rook-ceph is installing.

BobyMCbobs commented 9 months ago

rook-ceph is installed

🐚(2) kubectl get cephclusters.ceph.rook.io -A
NAMESPACE   NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH      EXTERNAL   FSID
rook-ceph   rook-ceph   /var/lib/rook     3          8m43s   Ready   Cluster created successfully   HEALTH_OK              36e48beb-8681-47fb-bf42-6fe8bf945e60
BobyMCbobs commented 9 months ago

istio installed.

But seeing these events in the istio-gateway service

...
  Normal   EnsuringLoadBalancer    35s (x6 over 3m10s)   service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  35s (x4 over 3m10s)   service-controller  Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]

and I've set metro and loadbalancer like this now https://github.com/ii/infra/commit/7d67e22f6b7e47226fcebcb7dcfb9f67097c6789

BobyMCbobs commented 9 months ago

still seeing these logs from the Equinix Metal Cloud Provider controller

 E0229 00:47:00.643524       1 controller.go:289] error processing service istio-system/istio-gateway (will retry): failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]
I0229 00:47:00.643564       1 event.go:294] "Event occurred" object="istio-system/istio-gateway" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0229 00:47:00.643588       1 event.go:294] "Event occurred" object="istio-system/istio-gateway" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: could not determine load balancer location for metro ; valid values are [da ny sv]"
I0229 00:47:01.162158       1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0229 00:47:01.857872       1 controller.go:289] error processing service kube-system/cloud-provider-equinix-metal-kubernetes-external (will retry): failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
I0229 00:47:01.857936       1 event.go:294] "Event occurred" object="kube-system/cloud-provider-equinix-metal-kubernetes-external" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
        Error syncing load balancer: failed to ensure load balancer: failed to add service cloud-provider-equinix-metal-kubernetes-external: token exchange request failed with status 401, body {"message":"Unauthorized"}
BobyMCbobs commented 9 months ago

It appears that the loadbalancer value needed three slashes https://github.com/ii/infra/commit/a3617e93ff1ea1ed5d094d9fee6b0a642d78640b

now seeing events like this in the Istio Service

...
  Normal   EnsuringLoadBalancer    46s (x7 over 6m3s)   service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  45s (x7 over 6m2s)   service-controller  Error syncing load balancer: failed to ensure load balancer: failed to add service istio-gateway: token exchange request failed with status 401, body {"message":"Unauthorized"}
BobyMCbobs commented 9 months ago

now using the domain for apiserver

https://github.com/ii/infra/commit/89345d005dc78ee741efffc5da8d528d3270984f

BobyMCbobs commented 9 months ago

Setting loadbalancer field in metal-cloud-config to empty:/// causes Equinix Metal Cloud Provider controller to allocate an IP address and assign it to the LoadBalancer type Service, however it doesn't appear to be routed.

When assigning the IP to a node manually, getting the following error attempting to talk to Istio 80 (http): Connection refused

vielmetti commented 9 months ago

Thanks for flagging this @BobyMCbobs - and very much appreciate the running commentary on what you're running into. We'll be reviewing the docs to make it easier for the next person through.

There was a bug in the readme.md for CPEM. The loadBalancer field should be emlb:///da (or whatever metro you're using, so emlb:///sv or wherever). The diff for the docs is here:

https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/commit/3fb9b358ecaa037ed271b29246ef8f9b29bab77a

vielmetti commented 9 months ago

One of the things we need to do is to update cpem to enable DEBUG logs. It would help if you could provide such debug logs from around the 401 error at https://github.com/ii/infra/issues/1#issuecomment-1970194077

cprivitere commented 9 months ago

Hey can you set the following env variable on the cloud-provider-equinix-metal daemonset's container spec and do a rollout restart on it. This should give some debug logs for the token exchanger that I think would be helpful if you could sanitize (remove the tokens) and share.

PACKNGO_DEBUG = 1
BobyMCbobs commented 9 months ago

@cprivitere, I wasn't able to see any additional logs with this set

BobyMCbobs commented 9 months ago

Using a user token instead of a project token appears to resolve any Equinix Metal API errors in the controller logs. However, it would be much better to use a project token.

BobyMCbobs commented 9 months ago

Got DNS for the cluster domain pointing to the floating IP through rf2136 support in the DNS provider.

BobyMCbobs commented 9 months ago

Using the hacky work around of attaching through the Terraform (https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L31-L35) instead of relying on the Equinix Metal Cloud Provider controller.

Currently using an eipTag which matches the floating IP tag https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L28 + https://github.com/ii/infra/blob/0d4e519/terraform/equinix-metal-talos-cluster/main.tf#L101 and seeing the following logs

...
I0301 01:23:26.762167       1 eip_controlplane_reconciliation.go:553] doHealthCheck(): checking control plane health through ip https://139.178.88.139:6443/healthz
I0301 01:23:26.763511       1 eip_controlplane_reconciliation.go:572] doHealthCheck(): health check through elastic ip failed, trying to reassign to an available controlplane node
I0301 01:23:26.767525       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.89.13:6443/healthz
I0301 01:23:26.769085       1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-rtulfevt: returned http code 401
I0301 01:23:26.769098       1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.90.247:6443/healthz
I0301 01:23:26.770239       1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-xcpvkwuc: returned http code 401
I0301 01:23:26.770249       1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.90.201:6443/healthz
I0301 01:23:26.771406       1 eip_controlplane_reconciliation.go:286] will not assign control plane endpoint to new device sharing-io-ifivestw: returned http code 401
E0301 01:23:26.771444       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: ccm didn't find a good candidate for IP allocation. Cluster is unhealthy
I0301 01:23:26.771466       1 eip_controlplane_reconciliation.go:125] handling update, node: sharing-io-rtulfevt
...

which results in a 401, because it is an anonymous request (https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/3fb9b358ecaa037ed271b29246ef8f9b29bab77a/metal/eip_controlplane_reconciliation.go#L250) instead of an authenticated request. The controller Pods have a projected token in which they may be able to use for authenticated requests to healthz/readyz/livez etc...

BobyMCbobs commented 9 months ago

Trying to use Cilium with Talos and I set up KubePrism along with pointing Cilium to it

🐚 talosctl --talosconfig ~/.talos/config-cloudnative-coop -n x.x.x.x get machineconfig -o yaml 2>/dev/null | yq e .spec.machine.features.kubePrism
enabled: true
port: 7445

However, I’m seeing these kinda logs

🐚 kubectl -n kube-system logs cilium-bx9fx -c config
...
level=error msg="Unable to contact k8s api-server" error="Get \"https://localhost:7443/api/v1/namespaces/kube-system\": dial tcp [::1]:7443: connect: connection refused" ipAddr="https://localhost:7443" subsys=k8s-client
level=error msg="Start hook failed" error="Get \"https://localhost:7443/api/v1/namespaces/kube-system\": dial tcp [::1]:7443: connect: connection refused" function="client.(*compositeClientset).onStart" subsys=hive
Error: failed to start: Get "https://localhost:7443/api/v1/namespaces/kube-system": dial tcp [::1]:7443: connect: connection refused
...

and it appears that it's not able to access the KubePrism endpoint for some reason

BobyMCbobs commented 9 months ago

It doesn't appear that resources in inline manifests are ever updated after bootstrap https://www.talos.dev/v1.6/reference/configuration/v1alpha1/config/#Config.cluster.inlineManifests.

edit:

If you need to update a manifest make sure to first edit all control plane machine configurations and then run talosctl upgrade-k8s as it will take care of updating inline manifests

hh commented 9 months ago

Yeah I’d expect that to only happen at the initial bootstrap. You’d have to deploy the talos machines again? (Or manually apply)

On Sun, 3 Mar 2024 at 15:14, Caleb Woodbine @.***> wrote:

It doesn't appear that resources in inline manifests are ever updated after bootstrap

https://www.talos.dev/v1.6/reference/configuration/v1alpha1/config/#Config.cluster.inlineManifests .

β€” Reply to this email directly, view it on GitHub https://github.com/ii/infra/issues/1#issuecomment-1975376716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHUY54VPIJCYYD3VTYPALYWOOFRAVCNFSM6AAAAABD243LW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGM3TMNZRGY . You are receiving this because you authored the thread.Message ID: @.***>

BobyMCbobs commented 9 months ago

Cilium is now working on newly created clusters.

BobyMCbobs commented 9 months ago

I'm trying to manually set the Tofu-provisioned floating IP address externalIPs and loadBalancerIP in the istio-gateway Service, but nothing am still getting a connection refused message. This is the same IP as the control plane (which is listening on :6443).

BobyMCbobs commented 9 months ago

Attempting with MetalLB + CPEM and currently seeing no IPs being allocated

🐚 kubectl get svc -A
...
istio-system     istio-gateway                                      LoadBalancer   10.101.197.94    <pending>     80:30102/TCP,443:32694/TCP              3h55m
istio-system     istiod                                             ClusterIP      10.107.149.180   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP   3h56m
kube-system      cloud-provider-equinix-metal-kubernetes-external   LoadBalancer   10.100.120.223   <pending>     443:32454/TCP                           110m
...

https://github.com/ii/infra/commit/23f1ca950a77c61213f5c63986ded515dca16484

although, the ConfigMap in metallb-system/config is being populated by CPEM.

BobyMCbobs commented 9 months ago

Istio Gateway is not responding to 80 or 443 currently and I'm trying to configure a gateway to make it resolve but no progress major progress yet.

BobyMCbobs commented 8 months ago

Very specific ordering for tofu import

tofu import -var-file=./values.tfvars -var equinix_metal_auth_token=$METAL_AUTH_TOKEN -var github_token="$(gh auth token)" module.sharing-io-manifests.kubernetes_namespace.kube-system kube-system
BobyMCbobs commented 8 months ago

Created the Talos Factory schematic with ID 0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6. It contains

running the following command on each node to manually perform an upgrade to it

talosctl --talosconfig ~/.talos/config-sharing-io -n NODE_IP upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6

For some reason the servers were provisioned with v1.6.0 and no extensions... not sure why.

BobyMCbobs commented 8 months ago

Get all Talos resources

talosctl --talosconfig ~/.talos/config-sharing-io -n A_NODE_IP get rd
hh commented 8 months ago

Another issues is that on first run the equinix_metal + bpg_neighbors that we need to configure the talos routes bombs the first time through.

β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β”‚   95:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "0"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β”‚   95:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "1"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 95, in resource "talos_machine_configuration_apply" "cp":
β”‚   95:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[0]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "2"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β”‚   97:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "0"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β”‚   97:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "1"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
β•·
β”‚ Error: Invalid index
β”‚
β”‚   on terraform/equinix-metal-talos-cluster-with-ccm/main.tf line 97, in resource "talos_machine_configuration_apply" "cp":
β”‚   97:                - network: ${data.equinix_metal_device_bgp_neighbors.bgp_neighbor[each.key].bgp_neighbors[0].peer_ips[1]}/32
β”‚     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     β”‚ data.equinix_metal_device_bgp_neighbors.bgp_neighbor is object with 3 attributes
β”‚     β”‚ each.key is "2"
β”‚
β”‚ The given key does not identify an element in this collection value: the collection has no elements.
β•΅
hh commented 8 months ago

Would be good to set kernel args for talos so we can see boot messages on the Equinix serial recovery / console. This is what an AlmaOS box does.

[root@c3-medium-x86-01 ~]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz root=UUID=96816452-f524-4d80-a493-aeebe04e2cf5 ro console=tty0 console=ttyS1,115200n8 rd.auto rd.auto=1
hh commented 8 months ago

Would be good to configure talos networking to bond enp65sf0f1 and enp65sf0f0:

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp65s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff
    altname ens3f0
3: enp65s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff permaddr 40:a6:b7:6f:78:a1
    altname ens3f1
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 40:a6:b7:6f:78:a0 brd ff:ff:ff:ff:ff:ff
    inet 139.178.70.133/31 scope global noprefixroute bond0
       valid_lft forever preferred_lft forever
    inet 10.67.90.135/31 scope global noprefixroute bond0:0
       valid_lft forever preferred_lft forever
    inet6 2604:1380:45e3:4300::7/127 scope global noprefixroute
       valid_lft forever preferred_lft forever
    inet6 fe80::42a6:b7ff:fe6f:78a0/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
hh commented 8 months ago

Upon first convergence, it would be nice before kubernetes is fully up (CCM / CPEM) for talos to assign the apiserver EIP to one of the nodes.

Does this have something to do with that?

             vip:
               ip: ${equinix_metal_reserved_ip_block.cluster_ingress_ip.address}
               equinixMetal:
                 apiToken: ${var.equinix_metal_auth_token}
hh commented 8 months ago

We now have one of our talos nodes upgraded to 1.6.6:

hh@m1 infra % talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.39 upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6
WARNING: 147.75.71.39: server version 1.6.0 is older than client version 1.6.6
β—° β—³ watching nodes: [147.75.71.39]
    * 147.75.71.39: stage: BOOTING ready: true unmetCond: []
Signal received, aborting, press Ctrl+C once again to abort immediately...
hh@m1 infra % kubectl get nodes -o wide                                                                                                                                                                       
NAME                  STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
sharing-io-icvoiyes   Ready    control-plane   21m   v1.29.2   147.75.71.39    <none>        Talos (v1.6.6)   6.1.80-talos     containerd://1.7.13
sharing-io-owyapupv   Ready    control-plane   21m   v1.29.2   147.75.71.237   <none>        Talos (v1.6.0)   6.1.67-talos     containerd://1.7.10
sharing-io-wagfsaif   Ready    control-plane   21m   v1.29.2   147.75.49.11    <none>        Talos (v1.6.0)   6.1.67-talos     containerd://1.7.10
hh commented 8 months ago

The upgrade wouldn't work start do to some etcd members not being healthy after the upgrade of the first node.

h@m1 infra % talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.237  upgrade --preserve --image factory.talos.dev/installer/0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6:v1.6.6  
WARNING: 147.75.71.237: server version 1.6.0 is older than client version 1.6.6                                                                                                                                   
β—° watching nodes: [147.75.71.237]                                                                                                                                                                                 
    * 147.75.71.237: rpc error: code = DeadlineExceeded desc = error validating etcd for upgrade: etcd member 5ecbd721c01aa1eb is not healthy; all members must be healthy to perform an upgrade: context deadline
 exceeded   

I wonder how we can help that member be healthy and have the other two connect!

talosctl --talosconfig ~/.talos/config-sharing-io -n 147.75.71.39,147.75.71.237,147.75.49.11   etcd status
WARNING: 2 errors occurred:
        * 147.75.49.11: rpc error: code = DeadlineExceeded desc = failed to create etcd client: error building etcd client: context deadline exceeded
        * 147.75.71.237: rpc error: code = DeadlineExceeded desc = failed to create etcd client: error building etcd client: context deadline exceeded

NODE           MEMBER             DB SIZE   IN USE            LEADER             RAFT INDEX   RAFT TERM   RAFT APPLIED INDEX   LEARNER   ERRORS
147.75.71.39   5ecbd721c01aa1eb   3.6 MB    1.4 MB (39.86%)   5ecbd721c01aa1eb   4612         3           4612                 false     
hh commented 8 months ago

Might be good to separate these into modules:

hh commented 8 months ago

Figured it was worth ensuring Talos was widely aware of this major change: https://github.com/siderolabs/talos/issues/8411

hh commented 8 months ago

I was able to explicitly specify the installer version here: https://github.com/ii/infra/commit/44bdd4f66a2fafcbfb07f16f3cc6b36c8e576898

Which seems to have enabled the installer and extensions:

kubectl --server https://147.75.71.255:6443 get nodes -o wide
NAME                  STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
sharing-io-liouumoj   Ready    control-plane   29m   v1.29.2   147.75.49.25    <none>        Talos (v1.6.6)   6.1.80-talos     containerd://1.7.13
sharing-io-ruewrpcf   Ready    control-plane   29m   v1.29.2   147.75.71.255   <none>        Talos (v1.6.6)   6.1.80-talos     containerd://1.7.13
sharing-io-vfwhaqop   Ready    control-plane   29m   v1.29.2   147.75.71.237   <none>        Talos (v1.6.6)   6.1.80-talos     containerd://1.7.13
talosctl -n 147.75.71.255,147.75.71.237,147.75.49.25 get extensions
NODE            NAMESPACE   TYPE              ID            VERSION   NAME          VERSION
147.75.71.255   runtime     ExtensionStatus   0             1         iscsi-tools   v0.1.4
147.75.71.255   runtime     ExtensionStatus   1             1         mdadm         v4.2-v1.6.6
147.75.71.255   runtime     ExtensionStatus   2             1         zfs           2.1.14-v1.6.6
147.75.71.255   runtime     ExtensionStatus   3             1         gvisor        20231214.0-v1.6.6
147.75.71.255   runtime     ExtensionStatus   4             1         schematic     0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.71.255   runtime     ExtensionStatus   modules.dep   1         modules.dep   6.1.80-talos
147.75.71.237   runtime     ExtensionStatus   0             1         iscsi-tools   v0.1.4
147.75.71.237   runtime     ExtensionStatus   1             1         mdadm         v4.2-v1.6.6
147.75.71.237   runtime     ExtensionStatus   2             1         zfs           2.1.14-v1.6.6
147.75.71.237   runtime     ExtensionStatus   3             1         gvisor        20231214.0-v1.6.6
147.75.71.237   runtime     ExtensionStatus   4             1         schematic     0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.71.237   runtime     ExtensionStatus   modules.dep   1         modules.dep   6.1.80-talos
147.75.49.25    runtime     ExtensionStatus   0             1         iscsi-tools   v0.1.4
147.75.49.25    runtime     ExtensionStatus   1             1         mdadm         v4.2-v1.6.6
147.75.49.25    runtime     ExtensionStatus   2             1         zfs           2.1.14-v1.6.6
147.75.49.25    runtime     ExtensionStatus   3             1         gvisor        20231214.0-v1.6.6
147.75.49.25    runtime     ExtensionStatus   4             1         schematic     0c2f6ca92c4bb5f7b79de5849bd2e96e026df55e4c18939df217e4f7d092a7c6
147.75.49.25    runtime     ExtensionStatus   modules.dep   1         modules.dep   6.1.80-talos