giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

BigMac testing of AWS v20 together with Vintage-CAPA migration #3211

Closed T-Kukawka closed 7 months ago

T-Kukawka commented 9 months ago

The time has come to start testing final releases as well as migration from Vintage v20 to CAPA. We have created a dedicated Vintage MC garfish to perform any vintage or migration testing for stability purposes. The dedicated CAPA cluster for migration will be the CAPA stable-testing MC grizzly.

We would kindly ask all teams to perform comprehensive tests for 3 use-cases, ordered in terms of priorities if they can't be performed all at once.

1. Vintage AWS v20

Cluster creation on garfish - giantswarm Organization

This is the last release of Vintage containing 1.25 k8s. The 1.25 kubernetes introduces a breaking change in terms of removal PSPs from its API, meaning that all workloads will have to comply with the global toggle disablingPSPs as in 19.3.x release. Prior to making v20 release available to customers, we need to validate that all applications are running smoothly. The Vintage tests are standard as always - you just create the v20 release and validate your applications. Separate stable MC in this case will guarantee no manual changes in the release and stability.

2. CAPA 0.60.0

Cluster creation on grizzly - giantswarm Organization - be aware that this is production MC, so it will page everyone. In practice any CAPA MC should work for this test.

Starting with cluster-aws-v0.60.0 and default-apps-aws-v0.45.1 onwards CAPA supports Kubernetes 1.25 with all needed features to run our workloads in the same manner as on VIntage clusters. Please for testing use always latest cluster-aws as well as default-apps-aws releases.

3. Vintage to CAPA migration

Cluster creation for migration on garfish - capa-migration-testing Organization. Clusters will be migrated to grizzly - capa-migration-testing Organization.

Phoenix and Honeybadger worked extensively on making the migration as smooth as possible. The migration-cli has been introduced that orchestrates migration of apps as well as infrastructure. Here the main point is to discover if your application and any custom configurations that could be applied by customers are migrated properly.

The migration-cli has been extended to facilitate easy testing for all teams ad Giant Swarm. Please follow the requirements as well as the procedure that is described in the tests section of the tool. In case of any issue with infrastructure - ping Phoenix, if the app/configmap migration will face any issues or inconsistencies - ping Honeybadger.

ssyno commented 9 months ago

Migration completed successfully, BigMac application are deployed on the target CAPI MC golem and function as expected.

...
Deleted vintage au6g2 node pool ASG.

Executing the following command to apply non-default apps to CAPI MC via external tool:
app-migration-cli apply -s garfish -d golem -n spyros02 -o org-capa-migration-testing
Connected to gs-garfish, k8s server version v1.24.17
Connected to gs-golem, k8s server version v1.24.16

All prerequistes are found on the new MC for app migration
Applying all non-default APP CRs to MC
All non-default apps applied successfully.

Apps (0) applied successfully to golem-spyros02
Finalizer removed on NS: garfish/spyros02
Finished migrating cluster spyros02 to CAPI infrastructure

On the new cluster we can see everything smoothly migrated

k get apps -norg-capa-migration-testing|grep spyros02
spyros02                               0.60.0              18m          13m             deployed
spyros02-app-operator                  6.10.0              18m          18m             deployed
spyros02-athena                        1.12.1              63s          59s             deployed
spyros02-aws-pod-identity-webhook      1.14.1              18m          14m             deployed
spyros02-capi-node-labeler             0.5.0               18m          11m             deployed
spyros02-cert-exporter                 2.8.5               18m          15m             deployed
spyros02-cert-manager                  3.7.0               18m          11m             deployed
spyros02-chart-operator                3.1.0               18m          11m             deployed
spyros02-chart-operator-extensions                         18m                          already-exists
spyros02-cluster-autoscaler            1.27.3-gs3          18m          11m             deployed
spyros02-default-apps                  0.45.1              18m          18m             deployed
spyros02-default-ingress-nginx         3.5.1               63s          58s             deployed
spyros02-default-rbac-bootstrap        0.2.1               63s          43s             deployed
spyros02-dex-app                       1.42.8              62s          1s              deployed
spyros02-etcd-k8s-res-count-exporter   1.8.0               18m          14m             deployed
spyros02-external-dns                  2.42.0              18m          11m             deployed
spyros02-grafana-agent                 0.3.2               18m          11m             deployed
spyros02-kube-prometheus-stack         8.1.1               18m          13m             deployed
spyros02-kyverno                       0.16.4              18m          15m             deployed
spyros02-kyverno-policies              0.20.2              18m          11m             deployed
spyros02-kyverno-policy-operator       0.0.6               18m          11m             deployed
spyros02-metrics-server                2.4.2               18m          13m             deployed
spyros02-net-exporter                  1.18.2              18m          15m             deployed
spyros02-node-exporter                 1.18.2              18m          11m             deployed
spyros02-observability-bundle          1.0.0               18m          18m             deployed
spyros02-prometheus-agent              0.6.6               18m          11m             deployed
spyros02-prometheus-operator-crd       8.0.0               18m          11m             deployed
spyros02-promtail                      1.4.1               18m          11m             deployed
spyros02-security-bundle               1.5.0               18m          18m             deployed
spyros02-teleport-kube-agent           0.7.0               18m          15m             deployed
spyros02-vertical-pod-autoscaler       4.6.0               18m          11m             deployed

There is only one issue on the certificates generation by cert-manager-app. That is related to the external-DNS which as of the following logs can't retrieve credentials.

time="2024-02-06T21:28:42Z" level=info msg="Instantiating new Kubernetes client"
time="2024-02-06T21:28:42Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2024-02-06T21:28:42Z" level=info msg="Created Kubernetes client https://172.31.0.1:443"
time="2024-02-06T21:29:19Z" level=error msg="records retrieval failed: failed to list hosted zones: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: Couldn't retrieve verification key from your identity provider,  please reference AssumeRoleWithWebIdentity documentation for requirements\n\tstatus code: 400, request id: b615ff15-46d4-469d-960a-a028d7410be7"
T-Kukawka commented 8 months ago

I have made adjustments in the tracking ticket as well as the teams tickets regarding the CAPA and migration testing instructions.

TL;DR: Testing of CAPA/Migration is moved from gazelle to grizzly

Initially gazelle has been chosen to test the CAPA migration as it is a Production MC, meaning most stable one. However this has resulted in unforeseen pages towards kaas-cloud oncall that we would like to limit.

We do recognise the pages and also actively work on testing, hence such pages are just a distraction away from the operations clusters that most of the teams have migrated the GS production workloads on.

Taking all the facts into consideration we have decided that it would be best to move the testing to grizzly which is stable-testing installation. Installation is primarily running e2e test and is treated as stable (no changes on the MCs). Thanks for understanding and let us know if something is not working

T-Kukawka commented 8 months ago

@gawertm can we close this?

gawertm commented 8 months ago

there was an issue with external-dns which most likely was not related to our apps. we wanted to monitori if the external-dns issue got fixed and then closing. if that's the case, we can close yes

T-Kukawka commented 8 months ago

@gawertm i believe this was the issue? https://github.com/giantswarm/giantswarm/issues/29985

gawertm commented 8 months ago

yes I think so, maybe @ssyno can confirm, he worked on that