Closed dduportal closed 5 months ago
First things first: connected to the account with the jenkins-infra-team
account (and its shared TOTP for 2FA) and was able to confirm we have the $60,000 credits:
Update: proposal to boostrap the AWS account. To be discussed and validated during the next weekly team meeting.
Root account:
Each Jenkins Infra team member ("OPS") will have a nominative AWS account with mandatory password and MFA, no API access (only Web Console) and only the permission to assume a role based on their "trust" level.
The following roles are proposed:
infra-admin
: allows management of usual resources (EC2, EKS, S3, etc.) but also access (read only) to billinginfra-user
: allows management of usual resources (EC2, EKS, S3, etc.)infra-read
: allows access (read-only) of usual resources (EC2, EKS, S3, etc.)The infrastructure as code (jenkins-infra/aws, Terraform project) will have 2 IAM users, and each one will only be able to assume a role.
The "Assume Role" means AWS STS will be used to generate 1 hour valid token (e.g. whether Web Console or API is used, the credential is only valid 1 hour). It will require additional commands for end users or Terraform but it will avoid keeping APi keys unchanged for months (years?).
We won't use the AWS IAM Identity Center as it is overkill (we only have one AWS account with just a few resources).
We won't deploy stuff outside of a base region (eventually 2), in a single AZ per region (no HA: it fails, then it fails).
The scope of resources must only be ephemeral workloads. Ideally for ci.jenkins.io: public services so the workloads are considered unsafe and untrusted by default (so no mix up with other controllers such as infra.ci.jenkins.io).
Update:
Update: proposal for the new AKS cluster to be soon created:
cijenkinsio-agents-1
. This name is valid as per https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/aks-common-issues-faq#what-naming-restrictions-are-enforced-for-aks-resources-and-parameters-
cijenkinsio
) to make identification easieragents
wording to make explicit this is the only acceptable usage for this cluster-1
as we'll most probably need to create more clusters in the future (AWS and eventually DOKS): migration will be easier if we can increment while keeping the the same naming convention/14
and already has 2 x /24
subnets. Need to carefully plan the sizing of the new subnet with the AKS network rules + sizing of Nodes and pods.l
for Azure Linux, w
for Windows, u
for Ubuntu Linux)x86
for Intel/AMD x86_64 or a64
for arm64)lin
)n
(n3
, n24
, etc.) on 3 chars maxlx86n3
=> Azure Linux x86_64 nodes used which can run 3 "normal" pod agents at the same timelx86n4bom
=> Azure Linux x86_64 nodes which can run 4 ("bom" only) pod agents at the same timeua64n24bom
=> Ubuntu Linux arm64 nodes which can run 24 ("bom" only) pod agents at the same timela64n2side
=> Linux arm64 nodes used to run 2 "side" pod (e.g. custom applications such as ACP).w
of course)2019
, 2022
, etc.)cik8s
EKS cluster):tiny_ondemand_linux
=> will be a syspool
following AKS good practises (HA, etc.). Should only hosts the Azure or AKS technical side-services, not ours (CSI, CNI, etc.)default_linux_az1
=> la64n2app
(2 "app" pods per node: ACP and datadog-cluster's agent)
la64n3app
if we add falco or any other toolspot_linux_4xlarge
=> lx86n3agt1
(Azure Linux node pool for agents number "1" supporting 3x pod agents)
0
and maximum of 50
(same as EKS)spot_linux_4xlarge_bom
=> lx86n3bom1
(Azure Linux node pool number "1" for BOM only supporting 3x pod agents)
spot_linux_4xlarge
except taints to be added to ensure only bom
builds are using this node poolspot_linux_24xlarge_bom
not retainedNetwork considerations:
We'll create a private cluster: https://learn.microsoft.com/en-us/azure/aks/private-clusters?tabs=azure-portal to ensure no external API is possible
The selected network mode will be "Azure CNI Overlay" as per https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#choosing-a-network-model-to-use
No inbound method is expected (we won't use an inbound LB)
The outbound method should be a "User assign NAT gateway" which will be the NAT gateway associated to the "public-sponsorship" network (same as ci.jenkins.io VM and ACI agents)
IP addresses planning (ref. https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#ip-address-planning)
cik8s
was set up to handle maximum 117 nodes (102 without the experimental 24x node pool we won't add in AKS) with 30 pod per nodes maxeks-public
was set up to handle a maximum of 4 nodes with 15 pods max per nodes/24
subnet for nodes, allowing ~250 max. nodes is enough => if we hit a limit we can had more pods per node!/24
internal CIDR per node is good enough.10.50.0.0/24
to ensure no overlap with ANY of the peered networks. Note that /24
is mandatoryNodes sizing considerations:
x86_64
CPUs for the build agents, let's use ARM64 for the othersStandard_D4pds_v5
:
System node pools require a VM SKU of at least 4 vCPUs and 4GB memory.
Standard_D4pds_v5
with ephemeral storageUpdate: first wabe of PRs on the network part:
Update:
443
) => incoming PRUpdate: the cluster is created after many retries:
privatek8s
in the future for instance) would be to use an Azure firewall instead of NSG to control inbound/outbound requests, as described in https://learn.microsoft.com/en-us/azure/firewall/protect-azure-kubernetes-service=> cluster is now created, with node pools and terraform project works as expected. Access works from ci.jenkins.io AND through VPN.
Next steps:
Update:
Update:
ci.jenkins.io-agents-1-jenkins-agent-sa-token
and ci.jenkins.io-agents-1-jenkins-agent-bom-sa-token
Next step: puppet template for ci.jenkins.io to draft a basic new Kubernetes template (goal: validate we can spin up agents with the initial configuration)
Update: initial verification of the new AKS cluster worked successfully:
node('maven-17-helpdesk-3954') {
sh 'mvn -v'
}
Next step:
Update:
WiP:
master
build manually on the agents
WARNING: invalid or unavailable artifact caching proxy provider 'azure-aks-internal' requested by the agent, will use repo.jenkins-ci.org
=> pipeline-library code to update!io.fabric8.kubernetes.client.KubernetesClientException: Received 403 on websocket. Failure executing: GET at: https://cijenkinsioagents1-<redacted>.azmk8s.io:443/api/v1/namespaces/jenkins-agents/pods?allowWatchBookmarks=true&watch=true. Message: Forbidden.
error message => gotta check kubeconfig AND the svc account tokenUpdate: we are ready to roll!
*ks-internal
ACP.Update: let's go live production! cc @MarkEWaite @smerle33 for info
Update:
Beware that @Vlatombe found some issues affecting the kubernetes
plugin on AKS relating (as I recall) to scalability issues with the API server.
Beware that @Vlatombe found some issues affecting the
kubernetes
plugin on AKS relating (as I recall) to scalability issues with the API server.
Thanks for the reminder! I remember it to be when using non default jnlp container in pod agents. If it’s the case then no problem as we are only all in one image.
additionnaly, we have set up the cluster to use better QOS on the control plane (« standard » tiers instead of « free » tiers) which has been introduced recently (it was recommended by Azure « clippy »)
we will watch carefully the behavior given your warning!
I remember it to be when using non default jnlp container in pod agents.
My recollection is that it affected all agents, not just using container
, and the problem was that this failed.
I remember it to be when using non default jnlp container in pod agents.
My recollection is that it affected all agents, not just using
container
, and the problem was that this failed.
Good to know: might have impacts on the BOM or on long term. Since we're using a distinct cluster only for the agents, any issue would be cordonned to this scope. Hopefully we won't run into it 🤞 Worst case, we'll have to wait until August before it's gone (after that we won't have anymore Azure credits to run containers in Azure: we'll switch back to a new AWS account and/or DigitalOcean)
After 3 days (and a BOM release proving that ci.jenkins.io works well with the new Kubernetes Linux agents) we can start decomissionning the former clusters cik8s
, eks-public
, doks
and doks-public
with the following steps:
Update:
Stop managing these clusters
https://github.com/jenkins-infra/kubernetes-management/pull/5243
Remove ci.jenkins.io configurations for these clusters
https://github.com/jenkins-infra/jenkins-infra/pull/3442
Delete these clusters from clouds (jenkins-infra/aws and jenkins-infra/digitalocean)
=> also, forgot to disable monitors, reminded by @smerle33 and done in https://github.com/jenkins-infra/datadog/pull/250
Update: this issue is closable:
Cleanup is finished:
doks-public
cluster (for ACP and PoC of the new update center)Confirmation that the AWS incurred costs are decreasing since the 17 May 2024:
Service(s)
AWS, Azure, ci.jenkins.io, sponsors
Summary
Today, ci.jenkins.io utilizes 2 EKS clusters to spin up ephemeral agents (for plugin and BOM builds). These clusters are hosted in a CloudBees-sponsored account (historically used to host a lot of Jenkins services).
We want to move these clusters out of CloudBees AWS to ensure non CloudBees Jenkins contributors can manage it and to use credits from other sponsors as AWS, DigitalOcean and Azure gave us credits to be used.
Initial working path (destination: AWS sponsored account)
Updated working path
As discussed during the 2 previous infra SIG meetings, we have around 28k$ credits on the Azure sponsored account which expires end of August 2024 (was May 2024 but @MarkEWaite asked for extension of this deadline ❤️ ), while both DigitalOcean and AWS (non CloudBees) accounts have credits until January 2025.
=> As such, let's start by using a Kubernetes cluster in Azure (sponsored) AKS to use these credits until end of summer before moving to the new AWS account
Notes 📖
A few elements for planning these migrations:
This is a good opportunity to re-assess the naming convention we used for jenkins-infra/aws project:
cik8s
andeks-public
for instance...The terraform module for EKS has a major upgrade version currently waiting (20.x): https://github.com/jenkins-infra/aws/pull/517 . It features breaking changes around the management of the EKS configmap. Upgrading the module by using the new version on fresh new cluster would avoid a tedious migration of existing ones...
We have an upcoming Kubernetes 1.27 upgrade: it will most probably be applied to AWS cluster before, but we have to keep it in mind
We'll have to define at least 2 different AWS providers in the Terraform project to allow management of both accounts at the same time: https://build5nines.com/terraform-deploy-to-multiple-aws-accounts-in-single-project/ (we already have this kind of pattern with Azure)
Reproduction steps
No response