Closed dduportal closed 1 year ago
Updating kubectl
:
Since we have disabled doks
due to Digital Ocean outage the 21 June 2023, we are taking the opportunity to upgrade both Digital Ocean clusters to 1.25 before putting back DigitalOcean clusters back to use.
Task list for both DigitalOcean clusters
[x] No need to announce publicly as we already have an open incident
[x] Changelog DO-specific: https://docs.digitalocean.com/products/kubernetes/details/changelog/. Other than the components upgrades (expected), 2 majors points:
[x] Preparation based on the changelog:
pluto
) are only on DO-internal components (cilium-operatore and kube-dns). Recycling the worker nodes upgrades to new API: no action needed as the cluster upgrade will take care of it[x] Upgrade both clusters to use HA control planes - https://github.com/jenkins-infra/digitalocean/pull/117
[x] Upgrade doks
using Terraform - https://github.com/jenkins-infra/digitalocean/pull/121
[x] Upgrade doks-public
using Terraform - https://github.com/jenkins-infra/digitalocean/pull/119
[x] enable kubernetes-management and check it's working - https://github.com/jenkins-infra/kubernetes-management/pull/4099
[x] Add doks back to ci.jenkins.io and test a pod with ACP - https://github.com/jenkins-infra/jenkins-infra/pull/2918
[x] Close status and report here - https://github.com/jenkins-infra/status/pull/330
Next step: upgrade of the AWS EKS clusters (including upgrade of components)
EndpointSlice
APi changes requires https://github.com/jenkins-infra/kubernetes-management/pull/3838 to be donecik8s
in ci.jenkins.io - https://github.com/jenkins-infra/jenkins-infra/pull/2949cik8s
and eks-public
from kubernetes-management - https://github.com/jenkins-infra/kubernetes-management/pull/4135cik8s
to 1.25.x - https://github.com/jenkins-infra/aws/pull/417
eks-public
to 1.25.x - https://github.com/jenkins-infra/aws/pull/417
cik8s
and eks-public
in kubernetes-management - https://github.com/jenkins-infra/kubernetes-management/pull/4136
PSP
while not existing in the API anymore 🤦 ). Solved by removing the eks
namespace in both clusters, and let the jenkins-infra/kubernetes-management job recreate the releases.cik8s
eks-public
eks-public
cik8s
in ci.jenkins.io - https://github.com/jenkins-infra/jenkins-infra/pull/2951
AKS upgrade:
[x] Check AKS changelog https://github.com/Azure/AKS/blob/master/CHANGELOG.md (searched for the 1.25 string), the following notable changes:
AKS begins pod security policy deprecation on 2022-11-01 API. The pod security policy will be removed completely on 2023-06-01 API with AKS 1.25 version or higher. You can migrate pod security policy to pod security admission controller before the deprecation deadline.
Starting with Kubernetes 1.25, the host VM operating system will be Ubuntu 22.04 for Intel and ARM64 architectures
Windows Server 2022 will be the default Windows host. Important, old windows 2019 containers will not work on windows server 2022 hosts.
✅ https://github.com/jenkins-infra/azure/blob/7be8bfbbb101c5bb591b258eb0418fb478d56767/privatek8s.tf#L132 fixed to Windows2019
LGTM
Updated Calico to v3.23.3 when Kubernetes version is greater than or equal to v1.25.0.
=> ✅ https://registry.terraform.io/providers/hashicorp/azurerm/3.64.0/docs/resources/kubernetes_cluster#network_policy we do not specify any network policy,n so no Calico (we should...)
Starting Kubernetes v1.25 two in-tree driver persistent volumes won't be supported in AKS : kubernetes.io/azure-disk, kubernetes.io/azure-file.
Java/JDK support for cgroups v2 is available in JDK 11 (patch 11.0.16 and later) or JDK 15 and above. AKS Kubernetes 1.25+ uses cgroups v2. Please migrate your workloads to the new JDK.
publick8s
and privatek8s
from kubernetes-management - https://github.com/jenkins-infra/kubernetes-management/pull/4149privatek8s
to 1.25.x - https://github.com/jenkins-infra/azure/pull/427
publick8s
to 1.25.x - https://github.com/jenkins-infra/azure/pull/428
kubenet
and the real subnet (ref. https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#overview-of-kubenet-networking-with-your-own-subnet).publick8s
and eks-privatek8s
in kubernetes-management - https://github.com/jenkins-infra/kubernetes-management/pull/4152There is an outage. http://get.jenkins.io/ 13:29:49 UTC Friday, 7 July 2023 .. Okay, I see you are aware of outage. Good luck, hope you can fix and recover it without extreme stress, thank you!
All the public services should be back. We are working on finishing the 1.25 post upgrade steps and we'll publish a post-mortem next week.
Sub-tasks left beforer closing this issue:
arm64
nodepool - https://github.com/jenkins-infra/kubernetes-management/pull/4160infraciadmin
service account used to administrate clusters (instead of a hidden script on my laptop...) - Tracked in https://github.com/jenkins-infra/helpdesk/issues/3679which transitively removes the "automatic" resource group where the public_ip must be
not true, set this label:
service.beta.kubernetes.io/azure-load-balancer-resource-group: myNetworkResourceGroup
https://learn.microsoft.com/en-us/azure/aks/static-ip#create-a-service-using-the-static-ip-address
which transitively removes the "automatic" resource group where the public_ip must be
not true, set this label:
service.beta.kubernetes.io/azure-load-balancer-resource-group: myNetworkResourceGroup
https://learn.microsoft.com/en-us/azure/aks/static-ip#create-a-service-using-the-static-ip-address
Thanks! We'll create a test IP in a resource group to check if we can safely move IPs to another resource group without recreating them, then we'll move prod IPs in a dedicated resource group (instead of the cluster node resource group) and add the label to the concerned services.
😱 Forgot the 1.25 logo:
Ref. https://kubernetes.io/blog/2022/08/23/kubernetes-v1-25-release/
Previous upgrade: https://github.com/jenkins-infra/helpdesk/issues/3387