[ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account

jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project

https://github.com/jenkins-infra/helpdesk/issues/new/choose

17 stars 10 forks source link

[ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954

Closed dduportal closed 5 months ago

dduportal commented 8 months ago

Service(s)

AWS, Azure, ci.jenkins.io, sponsors

Summary

Today, ci.jenkins.io utilizes 2 EKS clusters to spin up ephemeral agents (for plugin and BOM builds). These clusters are hosted in a CloudBees-sponsored account (historically used to host a lot of Jenkins services).

We want to move these clusters out of CloudBees AWS to ensure non CloudBees Jenkins contributors can manage it and to use credits from other sponsors as AWS, DigitalOcean and Azure gave us credits to be used.

Initial working path (destination: AWS sponsored account)

AWS is sponsoring the Jenkins project with $60.000 for 2024, which are applied to a fresh new AWS account.

We want to migrate the 2 clusters used by ci.jenkins.io into this new AWS account:

Moving out from CloudBees-owned AWS account allows non CloudBees employees to help managing these resources

Consuming these credits is key to ensure we can continue sponsor on long term

Updated working path

As discussed during the 2 previous infra SIG meetings, we have around 28k$ credits on the Azure sponsored account which expires end of August 2024 (was May 2024 but @MarkEWaite asked for extension of this deadline ❤️ ), while both DigitalOcean and AWS (non CloudBees) accounts have credits until January 2025.

=> As such, let's start by using a Kubernetes cluster in Azure (sponsored) AKS to use these credits until end of summer before moving to the new AWS account

Notes 📖

A few elements for planning these migrations:

This is a good opportunity to re-assess the naming convention we used for jenkins-infra/aws project:
- cik8s and eks-public for instance...
The terraform module for EKS has a major upgrade version currently waiting (20.x): https://github.com/jenkins-infra/aws/pull/517 . It features breaking changes around the management of the EKS configmap. Upgrading the module by using the new version on fresh new cluster would avoid a tedious migration of existing ones...
- Note: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/UPGRADE-20.0.md#%EF%B8%8F-upcoming-changes-planned-in-v210-%EF%B8%8F is important to have in mind!
We have an upcoming Kubernetes 1.27 upgrade: it will most probably be applied to AWS cluster before, but we have to keep it in mind
We'll have to define at least 2 different AWS providers in the Terraform project to allow management of both accounts at the same time: https://build5nines.com/terraform-deploy-to-multiple-aws-accounts-in-single-project/ (we already have this kind of pattern with Azure)

Reproduction steps

No response

dduportal commented 7 months ago

First things first: connected to the account with the jenkins-infra-team account (and its shared TOTP for 2FA) and was able to confirm we have the $60,000 credits:

Capture d’écran 2024-04-03 à 16 20 21

dduportal commented 7 months ago

Update: proposal to boostrap the AWS account. To be discussed and validated during the next weekly team meeting.

Root account:
- Should only be used for initial permission bootstrap, to add/remove Jenkins Infra team members or for account-level (exceptional) actions such as payment, credits management, etc.
- It has a shared password and a shared TOTP (MFA) encrypted using our SOPS system (GPG keys).
Each Jenkins Infra team member ("OPS") will have a nominative AWS account with mandatory password and MFA, no API access (only Web Console) and only the permission to assume a role based on their "trust" level.
The following roles are proposed:
- infra-admin: allows management of usual resources (EC2, EKS, S3, etc.) but also access (read only) to billing
- infra-user: allows management of usual resources (EC2, EKS, S3, etc.)
- infra-read: allows access (read-only) of usual resources (EC2, EKS, S3, etc.)
The infrastructure as code (jenkins-infra/aws, Terraform project) will have 2 IAM users, and each one will only be able to assume a role.
The "Assume Role" means AWS STS will be used to generate 1 hour valid token (e.g. whether Web Console or API is used, the credential is only valid 1 hour). It will require additional commands for end users or Terraform but it will avoid keeping APi keys unchanged for months (years?).
We won't use the AWS IAM Identity Center as it is overkill (we only have one AWS account with just a few resources).
We won't deploy stuff outside of a base region (eventually 2), in a single AZ per region (no HA: it fails, then it fails).
The scope of resources must only be ephemeral workloads. Ideally for ci.jenkins.io: public services so the workloads are considered unsafe and untrusted by default (so no mix up with other controllers such as infra.ci.jenkins.io).

dduportal commented 6 months ago

Update:

Body of this issue opened to materialize the new destination chosen during team meeting: until end of August 2024, we want ci.jenkins.io to use Azure credits (instead of non-CloudBees AWS or DigitalOcean credits)

dduportal commented 6 months ago

Update: proposal for the new AKS cluster to be soon created:

Subscription: the Azure "sponsored" subscription (same as
Name: cijenkinsio-agents-1. This name is valid as per https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/aks-common-issues-faq#what-naming-restrictions-are-enforced-for-aks-resources-and-parameters-
- No dots, only hyphens, letters and number. Less than 63 characters.
- Full name of the service (cijenkinsio) to make identification easier
- agents wording to make explicit this is the only acceptable usage for this cluster
- Suffix -1 as we'll most probably need to create more clusters in the future (AWS and eventually DOKS): migration will be easier if we can increment while keeping the the same naming convention
Connectivity:
- API access restricted to Jenkins Infra admins, privatek8s (to allow infra.ci to operate the cluster with terraform) and of course the ci.jenkins.io controller's subnet
- Should use its own subnet in the "public_sponsorship" virtual network.
- ⚠️ Need to refine the network sizing: This vnet already is a /14 and already has 2 x /24 subnets. Need to carefully plan the sizing of the new subnet with the AKS network rules + sizing of Nodes and pods.
- No ingress controller to be added:
  - None was present on cik8s
  - The ACP instance will be internal only: we'll set-it up with the Kubernetes Service internal DNS.
  - 💡 If ACP cannot be used with internal SVC hostname, we can always install a "private" ingress controller as fallback, but only on last resort
- No network segregation between node pools: this cluster is not multi tenant
Node pools:
- All Linux node pools should be Azure Linux as base OS (already done for infra.ci and release.ci controllers)
- Naming convention: given the constraint when naming node pool (ref. ):
- Linux Node pools (12 char. max.) will have:
  - the OS on 1-char (l for Azure Linux, w for Windows, u for Ubuntu Linux)
  - the CPU arch on 3 chars (x86 for Intel/AMD x86_64 or a64 for arm64)
  - The OS on 3 chars (lin)
  - The number of pods agents expected to be run on a single node of this pool (integer) preceded by a n (n3, n24, etc.) on 3 chars max
  - An eventual suffix on 3 letters to specify a custom usage. May replace the sizing if needed.
- Linux node pools naming examples:
  - lx86n3 => Azure Linux x86_64 nodes used which can run 3 "normal" pod agents at the same time
  - lx86n4bom => Azure Linux x86_64 nodes which can run 4 ("bom" only) pod agents at the same time
  - ua64n24bom => Ubuntu Linux arm64 nodes which can run 24 ("bom" only) pod agents at the same time
  - la64n2side => Linux arm64 nodes used to run 2 "side" pod (e.g. custom applications such as ACP).
- Windows Node pools (6 char. max. which is trickier) will have:
  - The OS (Windows) as 1-char prefix (w of course)
  - The Windows edition on 4-chars (2019, 2022, etc.)
  - An eventual suffix to nodepools rotation
- Expecting the following mappings (comparing to the existing cik8s EKS cluster):
- tiny_ondemand_linux => will be a syspool following AKS good practises (HA, etc.). Should only hosts the Azure or AKS technical side-services, not ours (CSI, CNI, etc.)
- default_linux_az1 => la64n2app (2 "app" pods per node: ACP and datadog-cluster's agent)
  - May change to la64n3app if we add falco or any other tool
  - ⚠️ Need to refine the nodes sizing
- spot_linux_4xlarge => lx86n3agt1 (Azure Linux node pool for agents number "1" supporting 3x pod agents)
  - ⚠️ Need to refine the nodes sizing: should be 16 vCPUS / 32 Gb / 90gb+ disk to map to current EKS sizing
  - No spot as we don't have access in the Azure subscription
  - Auto-scaled with minimum of 0 and maximum of 50 (same as EKS)
- spot_linux_4xlarge_bom => lx86n3bom1 (Azure Linux node pool number "1" for BOM only supporting 3x pod agents)
  - Same as spot_linux_4xlarge except taints to be added to ensure only bom builds are using this node pool
- spot_linux_24xlarge_bom not retained

dduportal commented 6 months ago

Network considerations:

We'll create a private cluster: https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=azure-portal to ensure no external API is possible
- Only ci.jenkins.io controller, Jenkins admins or the infra.ci.jenkins.io agents will communicate with the API of this new cluster
- ci.jenkins.io is on the same vnet: the private endpoint to reach AKS API will be used
- Jenkins admin will have to use VPN connection with the VPN VM is peered to the public sponsorship network => we'll need to set up a public DNS record to ensure it works.
- infra.ci.jenkins.io agents are in private network which may be peered to the public-sponsorship (or could be setup for a private endpoint)
The selected network mode will be "Azure CNI Overlay" as per https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#choosing-a-network-model-to-use
- Most of the agent pod network communication should be either to the internal ACP or the ci.jenkins.io controller (another subnet of the same vnet).
- We don't want to use advanced / edge features of AKS such as virtual nodes => The NAT pod <-> node subnet is acceptable
- We want to limit the amount of used IPs in the subnets while ensuring we can grow the cluster
No inbound method is expected (we won't use an inbound LB)
The outbound method should be a "User assign NAT gateway" which will be the NAT gateway associated to the "public-sponsorship" network (same as ci.jenkins.io VM and ACI agents)
- We'll then use a Standard LB SKU as per the doc
IP addresses planning (ref. https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#ip-address-planning)
- Cluster Nodes:
- Former cluster cik8s was set up to handle maximum 117 nodes (102 without the experimental 24x node pool we won't add in AKS) with 30 pod per nodes max
- Former cluster eks-public was set up to handle a maximum of 4 nodes with 15 pods max per nodes
- We plan to add (soon) node pools for Linux arm64, Windows 2019 and eventually Windows 2022
- Proposal: A /24 subnet for nodes, allowing ~250 max. nodes is enough => if we hit a limit we can had more pods per node!
- Pods: 250 pods per nod (maximum) is clearly a limit we won't reach until we run out of credits on Azure ;) The implicit default /24 internal CIDR per node is good enough.
- Pod CIDR: 10.50.0.0/24 to ensure no overlap with ANY of the peered networks. Note that /24 is mandatory
- Kubernetes service address range: Let's use the internal default (we don't need a lot of internal services)

dduportal commented 6 months ago

Nodes sizing considerations:

We require x86_64 CPUs for the build agents, let's use ARM64 for the others
Proposal for agents (BOM and non BOM) to use Dasv5 family
- Same size as for Azure VM agents
- Expecting 3 to 4 Pods per node. Considering (theoretically) 4 pods at 4 vCPUs and 8 Gb maximum, it means we need a Standard_D16ads_v5 (16 vCPUS / 32 Gb). vCPUs are limiting but we can make a reservation of only 3.5 and set a limit of 4. We'll even be able to add more memory!
- The node disk can be set to "Ephemeral", as there is a 600 Gb local SSD (really fast and free!) at 75000 IPOS (vs. the 3300 IOPS of current EKS). The current EKS requires ~30 Gb per pod on local storage so we're good to go!
- Autoscaling enabled from 0 to 50 => quotas are Ok for vCPUs on this region, but only 65 instances of DASv5 x86_64 are allowed in this region: increase requested
Regarding the "technical" node pool, let's roll for a Standard_D4pds_v5:
- ACP requirements: https://github.com/jenkins-infra/kubernetes-management/blob/0bf2e621e1a5ab786c4bdbb47780c4e8cf305c39/config/artifact-caching-proxy__common.yaml#L2-L8
- Ephemeral disk will be used. ACP needs a custom PVC: https://github.com/jenkins-infra/kubernetes-management/blob/0bf2e621e1a5ab786c4bdbb47780c4e8cf305c39/config/artifact-caching-proxy__common.yaml#L10-L12 which is distinct
System pool sizing:
- Ref. https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#system-and-user-node-pools: System node pools require a VM SKU of at least 4 vCPUs and 4GB memory.
- Let's roll for Standard_D4pds_v5 with ephemeral storage

dduportal commented 5 months ago

Update: first wabe of PRs on the network part:

dduportal commented 5 months ago

Update:

Edited https://github.com/jenkins-infra/helpdesk/issues/3954#issuecomment-2095952884 to map to new naming convention for node pools as per discussion with @lemeurherve
Manually tested creation of a cluster with the expected parameters to verify the "private" access:
- A minor tweaks in terraform are needed but it is only provider "logic" but it works
- Access to the AKS cluster in the Azure UI or from my admin machine both both works using the public DNS record and require VPN to be connected as expected
- The ci.jio controller's NSG requires an update to reach the controlle plane (port 443) => incoming PR
- NSG is required for the the kubernetes agent => incoming PR for a terraform module as we'll need the same for infra.ci and release.ci kubernetes agents

dduportal commented 5 months ago

Update: the cluster is created after many retries:

The initial tentative (https://github.com/jenkins-infra/azure/pull/693) as per https://github.com/jenkins-infra/azure/pull/693#issuecomment-2104997493
- Reverted by https://github.com/jenkins-infra/azure/pull/694
The second tentative (https://github.com/jenkins-infra/azure/pull/695) also failed to deploy as per https://github.com/jenkins-infra/azure/pull/695#issuecomment-2107180249
- Tried a partial creation + removed the manually created NSG rule fixed the first problem in https://github.com/jenkins-infra/azure/commit/126aaeb6596e905ec96d4f73d6ec6f9c03936506. The cluster was still stuck in "Creating" state (I though it was an egg-and-chicken problem due to subnet and NSG but I was wrong)
The second problem (wrong RG specified for a custom ci.jio controller NSG rule) was fixed by the following changes:
- Module:
- Naming fixes: https://github.com/jenkins-infra/shared-tools/commit/f251e97b330f8446758da057c4ac94e3410fe9be
- New output in https://github.com/jenkins-infra/shared-tools/commit/b289763a370522f96bfe2ce4c5653b9db303ceee#diff-a4528ad1b38902621045f67da7e5b2e913f4f1f86f84351bfd599a3c19635a5cR5-R7
- Terraform project (hotfix, using the new output): https://github.com/jenkins-infra/azure/commit/24e189be5fafd7a9f6bf66f3ee7d98bcdfacf8ac#diff-02b9d4486707195b5e88f2b9b8377fa4268b911328c71733338546e7
The third problem (cluster stuck in "Creating" state) was solved after understanding the low level issue:
- Root cause: Nodes (in the AzureRM subnet) were forbidden to talk to the control plane (private endpoint NIC in the the same subnet) due to the NSG created for inbound agents.
- Solution: Changed the NSG rules in the module to allow specifying the custom CIDR for pods, as we only want to restrict in/out for the pod agents, not for the nodes in the subnet itself (we use the new CNI overlay).
- Note 1 ⚠️ The resulting NSG rules might be un-needed: to be verified (that they block connections). If this is the case, we'll have to look on AKS Network policy (to delegate network blocking to AKS itself)
- Note 2 ⚠️ Another solution (for privatek8s in the future for instance) would be to use an Azure firewall instead of NSG to control inbound/outbound requests, as described in https://learn.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service

=> cluster is now created, with node pools and terraform project works as expected. Access works from ci.jenkins.io AND through VPN.

Next steps:

Required validations:
- Check (and fix if needed) AKS API (private) access from infra.ci pod agents to ensure we can manage kubernetes
- Check that the user outbound gateway is used properly (checking the 2 outbound public IPs) from the new AKS cluster
- Check the Azure diagnostic tools is not pointing out any error + there are no network overlap between Pod CIDR and our networks
Add cluster to kubernetes-management in infra.ci.jenkins.io:
- Set up admin credential
- Install datadog, jenkins-agent and jenkins-agents-bom releases at first (no ACP yet, no ingress/cert-manager/acme/docker-registry-credz)
Add initial template on ci.jenkins.io for testing agent creation
- Set up ci.jio agent user AND agents-bom credentials
- Create 2 new kubernetes cloud with custom labels
- Verify we can create agents on both clouds
- Add setups in Puppet with updated resources requirements/limits (less required CPU to pack 4 agents per nodes, more memory for both)
- Verify the new setup is valid (spining up 5 agents and 5 bom agents)
Add ACP to the mix
- Need to add tolerations supports for ACP helm chart
- Deploy ACP without ingress
- Add and validate a new settings.xml (without username/password and using the internal SVC)
E2E Validation with jenkins-infra-plugin-test
Disable ci.jenkins.io artifact storing in S3
- Check disk size prior!
Then prepare migration:
- Announce to developers (status/mailing list/IRC)
- In ci.jenkins.io Jenkins controller, replace the current AWS + DigitalOcean agent kubernetes "Cloud" templates by only the new one (remove the formers, update labels for the latters)
Cleanup:
- After 1_2 days with bom and plugins builds (and no disk full issues!)
- EKS clusterS removal (cik8s and eks-public)
- ci.jio JCasc and credentials cleanup
- kubernetes management
- Terraform
- Cloud resources
- DOKS clusterS removal (doks and doks-public)
  - ci.jio JCasc and credentials cleanup
  - kubernetes management
  - Terraform
  - Cloud resources

dduportal commented 5 months ago

Update:

Cluster admin credentials generated (https://github.com/jenkins-infra/azure/pull/696) and inserted into infra.ci (https://github.com/jenkins-infra/kubernetes-management/pull/5224)
Cluster management introduced in https://github.com/jenkins-infra/kubernetes-management/pull/5225 after removing the subnet NSG in https://github.com/jenkins-infra/azure/pull/698

dduportal commented 5 months ago

Update:

[ci.jenkins.io]
- Removed the S3 artifact plugin and its configuration to ensure the stashed/unstashed/archived build artifacts are stored in Azure (and not in AWS)
- We deleted the configuration in the UI and disabled the plugin thinking it would only "opt-out". But it broke ci.jenkins.io as it restarted: we did not remove JCasc configuration yet, so class loading failed during startup sequence. Fixed by https://github.com/jenkins-infra/jenkins-infra/pull/3421
- Impact on the disk usage: the data disk was using 43% of its capacity the 14 May. Today (15th May) it uses 44%. We'll keep monitoring.
- Impact on the Azure bandwidth: until ci.jenkins.io fully utilizes the new ASK cluster, we'll be billed for the outbound bandwidth (when an agent unstashes an artifact, most impactul build is the BOM)
- 2 system credentials added to authenticat against the 2 namespaces in the new cluster: ci.jenkins.io-agents-1-jenkins-agent-sa-token and ci.jenkins.io-agents-1-jenkins-agent-bom-sa-token

Next step: puppet template for ci.jenkins.io to draft a basic new Kubernetes template (goal: validate we can spin up agents with the initial configuration)

dduportal commented 5 months ago

Update: initial verification of the new AKS cluster worked successfully:

https://github.com/jenkins-infra/jenkins-infra/pull/3426 to deploy the configuration
- Memory bumped to 12 Gb
- Only "testing" pods for agents
Ran an example job with success: https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/3674/console:

node('maven-17-helpdesk-3954') {
    sh 'mvn -v'
}

Next step:

Update puppet manifests to fine-tune the pod resources requests and limits
Add the bom cloud template and ensure it spawns agents in the second node pool
Work on ACP deployment
Use updatecli to keep the amount of pods up to date

dduportal commented 5 months ago

Update:

ACP deployment is done
- Fix on the helm chart to support tolerations and nodeSelector (at least): https://github.com/jenkins-infra/helm-charts/issues/1159
- Initial deployment (tested manually before) in https://github.com/jenkins-infra/kubernetes-management/pull/5236
[ci.jenkins.io] Setup of the 2nd Kubernetes cloud for BOM + adding the new ACP settings in https://github.com/jenkins-infra/jenkins-infra/pull/3427

WiP:

Tried to run the https://github.com/jenkinsci/jenkins-infra-test-plugin master build manually on the agents
- Not auditable: gotta need a PR!
- Pod agents scheduled with success, but not able to use the new ACP out of the box: it falls back to repo.jenkinsc-io.org with the message WARNING: invalid or unavailable artifact caching proxy provider 'azure-aks-internal' requested by the agent, will use repo.jenkins-ci.org => pipeline-library code to update!
Tried to schedule a dummy build to use the new BOM cluster: pod allocation failed with io.fabric8.kubernetes.client.KubernetesClientException: Received 403 on websocket. Failure executing: GET at: https://cijenkinsioagents1-<redacted>.azmk8s.io:443/api/v1/namespaces/jenkins-agents/pods?allowWatchBookmarks=true&watch=true. Message: Forbidden. error message => gotta check kubeconfig AND the svc account token

dduportal commented 5 months ago

Update: we are ready to roll!

BOM cloud agents has been fixed:
- Defined a new label to ensure the correct node pool is used (using nodeSelector) with https://github.com/jenkins-infra/azure/pull/701 and https://github.com/jenkins-infra/azure/pull/701
- Fixed the error by setting up the correct namespace for BOM agents: https://github.com/jenkins-infra/azure/pull/701
ACP was a bit more complex to fix but is now working:
- Added the new ACP provider ID in the list of "valid providers" in https://github.com/jenkins-infra/pipeline-library/pull/858.
- Had to remove the manuel env var set up in ci.jenkins.io which was overriding this setting
- Introduced a "middle term" fix in pipeline library to support healtchcheck of ACP with the new URL in https://github.com/jenkins-infra/pipeline-library/pull/859. It will support future *ks-internal ACP.
- Note that long term fix would be to avoid spreading the provider names and URLs between 2 sources of config (see PR for details)
- Tested with https://github.com/jenkinsci/jenkins-infra-test-plugin/pull/126:
  - First build time increased initially from ~1min 17 to 2min 58, proving a new ACP was used (time was spent during dependency retrieval).
  - Second build time was still around 2min57 (using the second ACP instance)
  - Third and fourth build were back to 1min18 (warmed up cache)

dduportal commented 5 months ago

Update: let's go live production! cc @MarkEWaite @smerle33 for info

New cluster used and older cluster disabled in https://github.com/jenkins-infra/jenkins-infra/pull/3431
Information communicated to contributors in https://groups.google.com/g/jenkinsci-dev/c/wGYvL_s4hFU
I've prepared an "hotfix" PR to rollback the change if anything wrong happen in the upcoming days: https://github.com/jenkins-infra/jenkins-infra/pull/3432

dduportal commented 5 months ago

Update:

BOM build was stuck since ~9 min due to a forgotten label. Fixed by https://github.com/jenkins-infra/jenkins-infra/commit/f607c0f5b07274a28bfcee8a5fab533a965227b8

jglick commented 5 months ago

Beware that @Vlatombe found some issues affecting the kubernetes plugin on AKS relating (as I recall) to scalability issues with the API server.

dduportal commented 5 months ago

Beware that @Vlatombe found some issues affecting the kubernetes plugin on AKS relating (as I recall) to scalability issues with the API server.

Thanks for the reminder! I remember it to be when using non default jnlp container in pod agents. If it’s the case then no problem as we are only all in one image.

additionnaly, we have set up the cluster to use better QOS on the control plane (« standard » tiers instead of « free » tiers) which has been introduced recently (it was recommended by Azure « clippy »)

we will watch carefully the behavior given your warning!

jglick commented 5 months ago

I remember it to be when using non default jnlp container in pod agents.

My recollection is that it affected all agents, not just using container, and the problem was that this failed.

dduportal commented 5 months ago

I remember it to be when using non default jnlp container in pod agents.

My recollection is that it affected all agents, not just using container, and the problem was that this failed.

Good to know: might have impacts on the BOM or on long term. Since we're using a distinct cluster only for the agents, any issue would be cordonned to this scope. Hopefully we won't run into it 🤞 Worst case, we'll have to wait until August before it's gone (after that we won't have anymore Azure credits to run containers in Azure: we'll switch back to a new AWS account and/or DigitalOcean)

dduportal commented 5 months ago

After 3 days (and a BOM release proving that ci.jenkins.io works well with the new Kubernetes Linux agents) we can start decomissionning the former clusters cik8s , eks-public, doks and doks-public with the following steps:

Stop managing these clusters (jenkins-infra/kubernetes-management)
Delete these clusters from clouds (jenkins-infra/aws and jenkins-infra/digitalocean)
Remove ci.jenkins.io configurations for these clusters (jenkins-infra/jenkins-infra and credentials)
Remove leftovers ((jenkins-infra/kubernetes-management, jenkins-infra/chart-secrets, etc.)

dduportal commented 5 months ago

Update:

Stop managing these clusters

https://github.com/jenkins-infra/kubernetes-management/pull/5243

Remove ci.jenkins.io configurations for these clusters

https://github.com/jenkins-infra/jenkins-infra/pull/3442

Delete these clusters from clouds (jenkins-infra/aws and jenkins-infra/digitalocean)

=> also, forgot to disable monitors, reminded by @smerle33 and done in https://github.com/jenkins-infra/datadog/pull/250

dduportal commented 5 months ago

Update: this issue is closable:

Cleanup is finished:
- Removed leftovers DNS records pointing to the doks-public cluster (for ACP and PoC of the new update center)
- https://github.com/jenkins-infra/azure-net/pull/240
- https://github.com/jenkins-infra/azure/pull/707
- Updated documentation and runbooks to remove references of these 2 clusters:
- https://github.com/jenkins-infra/documentation/pull/31
- https://github.com/jenkins-infra/runbooks/pull/83
- https://github.com/jenkins-infra/helm-charts/pull/1166
Confirmation that the AWS incurred costs are decreasing since the 17 May 2024: