jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[ci.jenkins.io] Migrate ci.jenkins.io EKS clusters out from CloudBees AWS account #3954

Closed dduportal closed 5 months ago

dduportal commented 8 months ago

Service(s)

AWS, Azure, ci.jenkins.io, sponsors

Summary

Today, ci.jenkins.io utilizes 2 EKS clusters to spin up ephemeral agents (for plugin and BOM builds). These clusters are hosted in a CloudBees-sponsored account (historically used to host a lot of Jenkins services).

We want to move these clusters out of CloudBees AWS to ensure non CloudBees Jenkins contributors can manage it and to use credits from other sponsors as AWS, DigitalOcean and Azure gave us credits to be used.

Initial working path (destination: AWS sponsored account)

AWS is sponsoring the Jenkins project with $60.000 for 2024, which are applied to a fresh new AWS account.

We want to migrate the 2 clusters used by ci.jenkins.io into this new AWS account:

  • Moving out from CloudBees-owned AWS account allows non CloudBees employees to help managing these resources
  • Consuming these credits is key to ensure we can continue sponsor on long term

Updated working path

As discussed during the 2 previous infra SIG meetings, we have around 28k$ credits on the Azure sponsored account which expires end of August 2024 (was May 2024 but @MarkEWaite asked for extension of this deadline ❤️ ), while both DigitalOcean and AWS (non CloudBees) accounts have credits until January 2025.

=> As such, let's start by using a Kubernetes cluster in Azure (sponsored) AKS to use these credits until end of summer before moving to the new AWS account


Notes 📖

A few elements for planning these migrations:

Reproduction steps

No response

dduportal commented 7 months ago

First things first: connected to the account with the jenkins-infra-team account (and its shared TOTP for 2FA) and was able to confirm we have the $60,000 credits:

Capture d’écran 2024-04-03 à 16 20 21

dduportal commented 7 months ago

Update: proposal to boostrap the AWS account. To be discussed and validated during the next weekly team meeting.

dduportal commented 6 months ago

Update:

dduportal commented 6 months ago

Update: proposal for the new AKS cluster to be soon created:

dduportal commented 6 months ago

Network considerations:

dduportal commented 6 months ago

Nodes sizing considerations:

dduportal commented 5 months ago

Update: first wabe of PRs on the network part:

dduportal commented 5 months ago

Update:

dduportal commented 5 months ago

Update: the cluster is created after many retries:

=> cluster is now created, with node pools and terraform project works as expected. Access works from ci.jenkins.io AND through VPN.

Next steps:

dduportal commented 5 months ago

Update:

dduportal commented 5 months ago

Update:

Next step: puppet template for ci.jenkins.io to draft a basic new Kubernetes template (goal: validate we can spin up agents with the initial configuration)

dduportal commented 5 months ago

Update: initial verification of the new AKS cluster worked successfully:

node('maven-17-helpdesk-3954') {
    sh 'mvn -v'
}

Next step:

dduportal commented 5 months ago

Update:

WiP:

dduportal commented 5 months ago

Update: we are ready to roll!

dduportal commented 5 months ago

Update: let's go live production! cc @MarkEWaite @smerle33 for info

dduportal commented 5 months ago

Update:

jglick commented 5 months ago

Beware that @Vlatombe found some issues affecting the kubernetes plugin on AKS relating (as I recall) to scalability issues with the API server.

dduportal commented 5 months ago

Beware that @Vlatombe found some issues affecting the kubernetes plugin on AKS relating (as I recall) to scalability issues with the API server.

Thanks for the reminder! I remember it to be when using non default jnlp container in pod agents. If it’s the case then no problem as we are only all in one image.

additionnaly, we have set up the cluster to use better QOS on the control plane (« standard » tiers instead of « free » tiers) which has been introduced recently (it was recommended by Azure « clippy »)

we will watch carefully the behavior given your warning!

jglick commented 5 months ago

I remember it to be when using non default jnlp container in pod agents.

My recollection is that it affected all agents, not just using container, and the problem was that this failed.

dduportal commented 5 months ago

I remember it to be when using non default jnlp container in pod agents.

My recollection is that it affected all agents, not just using container, and the problem was that this failed.

Good to know: might have impacts on the BOM or on long term. Since we're using a distinct cluster only for the agents, any issue would be cordonned to this scope. Hopefully we won't run into it 🤞 Worst case, we'll have to wait until August before it's gone (after that we won't have anymore Azure credits to run containers in Azure: we'll switch back to a new AWS account and/or DigitalOcean)

dduportal commented 5 months ago

After 3 days (and a BOM release proving that ci.jenkins.io works well with the new Kubernetes Linux agents) we can start decomissionning the former clusters cik8s , eks-public, doks and doks-public with the following steps:

dduportal commented 5 months ago

Update:

Stop managing these clusters

https://github.com/jenkins-infra/kubernetes-management/pull/5243

Remove ci.jenkins.io configurations for these clusters

https://github.com/jenkins-infra/jenkins-infra/pull/3442

Delete these clusters from clouds (jenkins-infra/aws and jenkins-infra/digitalocean)

=> also, forgot to disable monitors, reminded by @smerle33 and done in https://github.com/jenkins-infra/datadog/pull/250

dduportal commented 5 months ago

Update: this issue is closable:

Screenshot 2024-05-21 at 15 56 38