giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Recreate Glippy CAPZ MC #2021

Closed primeroz closed 1 year ago

primeroz commented 1 year ago

Motivation

Related to https://github.com/giantswarm/roadmap/issues/2011

Recreate Glippy MC so we can proceed with the private WC implementation

TODO

For Clippy

primeroz commented 1 year ago

config repo changes to push

The following need to be added to mc-bootstrap since they cannot be moved to the default since that would affect vintage clusters

nprokopic commented 1 year ago

Coming from this comment https://github.com/giantswarm/roadmap/issues/2011#issuecomment-1430034809

I would like that we "reserve" 10.223.0.0/16 in our VPN for CAPZ MCs (+WCs), which is /16 range that comes before CAPA goat. That should give us 16 MCs (+WCs) with 10.223.x.0/20 shared range, which I hope we won't have to use and that we will sooner move to a more scalable solution with private endpoints (or similar).

That being said, for glippy and its MCs, we would use:

Then, next test CAPZ MC (if/when we need one) would take 10.223.16.0/20 for VPN shared range, broken down in the similar way like glippy above.

primeroz commented 1 year ago

Ok, once the PR is merged we will create a new release for cluster-azure and , on glippy recreation, set

We currently don't support creating extra subnets in the network spec https://github.com/giantswarm/cluster-azure/blob/main/helm/cluster-azure/templates/_azure_cluster.tpl#L16

so for now we'll stick to this and get the bastion on the control plane subnet

nprokopic commented 1 year ago

Slightly updated comment above, added missing calculation (I hit ctrl-enter accidentally :grimacing:)

primeroz commented 1 year ago

Follow Up Issues

➜ kubectl --kubeconfig=./kubeconfigs/kind-mc-initial-glippy.kubeconfig get -n org-giantswarm machine.cluster.x-k8s.io/glippy-bastion-86cb948f85-h9lgk -o yaml | yq .status.phase Provisioned

While for MD machines it is indeed Running

* Similar to ^ the `machineDeployment` for bastoin reports `ScalingUp` 

➜ kubectl --kubeconfig=./kubeconfigs/kind-mc-initial-glippy.kubeconfig get -n org-giantswarm machinedeployment.cluster.x-k8s.io/glippy-bastion -o yaml | yq .status.phase ScalingUp

* this ^ also blocks the Pivot because the bastion is `Provisioned` and not `Running` :( 

"Machine" CRs are not ready waiting for 20s

➜ k get cm -n giantswarm glippy-chart-values -o yaml | egrep -i "baseDomain|managementCluster|provider" baseDomain: test.gigantic.io managementCluster: glippy provider: capz providerSpecific:


* trivy is pending `giantswarm          trivy-0                                                          0/1     Pending            0               57m`
* prometheus for glippy is crashlooping `glippy-prometheus   prometheus-glippy-0                                              1/2     CrashLoopBackOff   6 (4m6s ago)    16m` - `OOMing`
* PMO not creating heartbeat
* PMO is failing to create VPA ( for prometheus ? ) causing prometheus to OOM
* vpa patch pod failing
primeroz commented 1 year ago

test ssh bastion

$ gopass env giantswarm/opsctl/ opsctl ssh glippy glippy-md00-0f94be16-cl6bs -b glippy_auto_branch --level=debug 
$ ssh -J giantswarm@$(kubectl get machine glippy-bastion-694f789f4d-vpslr -o json | jq -r '.status.addresses[] | select( .type == "ExternalIP").address') giantswarm@10.233.0.7 hostname
glippy-md00-0f94be16-cl6bs

@nprokopic When we create the first Private cluster i am really curious to

primeroz commented 1 year ago

Test Create WC Cluster

➜ clusterctl describe cluster fctest1 --echo --grouping=false      
NAME                                                                             READY  SEVERITY  REASON  SINCE  MESSAGE 
Cluster/fctest1                                                                  True                     6m26s           
├─ClusterInfrastructure - AzureCluster/fctest1                                   True                     8m4s            
├─ControlPlane - KubeadmControlPlane/fctest1                                     True                     6m26s           
│ └─Machine/fctest1-c8x5h                                                        True                     6m26s           
│   ├─BootstrapConfig - KubeadmConfig/fctest1-nvtmf                              True                     8m3s            
│   └─MachineInfrastructure - AzureMachine/fctest1-control-plane-c17c01d5-strjq  True                     6m27s           
└─Workers                                                                                                                 
  ├─MachineDeployment/fctest1-bastion                                            True                     9m40s           
  │ └─Machine/fctest1-bastion-64f988dff6-hwq7l                                   True                     14s             
  │   ├─BootstrapConfig - KubeadmConfig/fctest1-bastion-973fd873-d7vvm           True                     6m34s           
  │   └─MachineInfrastructure - AzureMachine/fctest1-bastion-836b66f0-nzp7r      True                     14s             
  └─MachineDeployment/fctest1-md00                                               True                     5m5s            
    ├─Machine/fctest1-md00-b886b4db4-4lf4z                                       True                     5m11s           
    │ ├─BootstrapConfig - KubeadmConfig/fctest1-md00-5f92899e-dp9qj              True                     6m34s           
    │ └─MachineInfrastructure - AzureMachine/fctest1-md00-22b81e6f-5xfmr         True                     5m11s           
    ├─Machine/fctest1-md00-b886b4db4-77cx6                                       True                     5m10s           
    │ ├─BootstrapConfig - KubeadmConfig/fctest1-md00-5f92899e-nlwnz              True                     6m34s           
    │ └─MachineInfrastructure - AzureMachine/fctest1-md00-22b81e6f-bzr8z         True                     5m10s           
    └─Machine/fctest1-md00-b886b4db4-9rrf7                                       True                     5m11s           
      ├─BootstrapConfig - KubeadmConfig/fctest1-md00-5f92899e-7hr78              True                     6m34s           
      └─MachineInfrastructure - AzureMachine/fctest1-md00-22b81e6f-5bjt9         True                     5m11s       
NAME                                     INSTALLED VERSION   CREATED AT   LAST DEPLOYED   STATUS
fctest1                                  0.0.11              7m11s        7m11s           deployed
fctest1-app-operator                     6.4.2               7m7s         6m48s           deployed
fctest1-azure-cloud-controller-manager   1.24.6-gs1          3m31s        2m42s           deployed
fctest1-azure-cloud-node-manager         1.24.6-gs1          3m31s        3m22s           deployed
fctest1-azuredisk-csi-driver             1.26.2-gs1          3m31s        3m24s           deployed
fctest1-cert-exporter                    2.3.1               3m31s        2m44s           deployed
fctest1-chart-operator                   2.33.0              7m6s         3m29s           deployed
fctest1-cilium                           0.6.1               3m31s        3m20s           deployed
fctest1-coredns                          1.13.0              3m31s        2m43s           deployed
fctest1-default-apps                     0.0.11              7m11s        3m31s           deployed
fctest1-kube-state-metrics               1.14.2              3m31s        43s             deployed
fctest1-metrics-server                   2.0.0               3m31s        3m21s           deployed
fctest1-net-exporter                     1.13.0              3m31s        2m42s           deployed
fctest1-node-exporter                    1.15.0              3m31s        43s             deployed
fctest1-observability-bundle             0.1.9               3m31s        3m29s           deployed
fctest1-prometheus-agent                                     3m29s                        secret-merge-failed
fctest1-prometheus-operator-app          3.0.0               3m29s        40s             deployed
fctest1-prometheus-operator-crd          3.0.0               3m29s        42s             deployed
fctest1-vertical-pod-autoscaler          2.5.3               3m31s        2m39s           deployed
fctest1-vertical-pod-autoscaler-crd      1.0.1               3m31s        2m43s           deployed
➜ kubectl --kubeconfig /dev/shm/fctest1 get node    
NAME                                   STATUS   ROLES           AGE   VERSION
fctest1-control-plane-c17c01d5-strjq   Ready    control-plane   2m    v1.24.10
fctest1-md00-22b81e6f-5bjt9            Ready    <none>          54s   v1.24.10
fctest1-md00-22b81e6f-5xfmr            Ready    <none>          55s   v1.24.10
fctest1-md00-22b81e6f-bzr8z            Ready    <none>          58s   v1.24.10

➜ kubectl --kubeconfig /dev/shm/fctest1 get pod -A | grep -v Running | grep -v Completed
NAMESPACE     NAME                                                            READY   STATUS    RESTARTS   AGE
primeroz commented 1 year ago
primeroz commented 1 year ago

https://gigantic.slack.com/archives/C02GDJJ68Q1/p1676555870369069

primeroz commented 1 year ago

Had to fix VPA CRD APP by hand

We need to change the way we install vpa-crd in mc-bootstrap to do a helm install rather than the APP install to avoid this problems

primeroz commented 1 year ago

All Issues from glippy recreate are now addressed

primeroz commented 1 year ago

I will close this one once the PMO is released and i have updated CAPZ-APP-COLLECTION

primeroz commented 1 year ago

Re-Re created glippy with right 10.223 network cidrs

primeroz commented 1 year ago

this is all done an dPMO was released