Closed dduportal closed 1 year ago
With this migration we'll be able to close https://github.com/jenkins-infra/helpdesk/issues/3209
Putting on hold: #2844 tracks the migration of release.ci from prodpublick8s
to privatek8s
, before proceeding forward here.
weekly.jenkins.io migrated:
We noticed that while the LDAP wasn't accessible for Jenkins, it doesn't rendered the HTML set as welcome message.
Next steps: migrating the services relying on a postgreSQL database.
To avoid any propagation of the network overlap, we need a flexible Postgres instance that the public-vnet
network can access.
A new instance has to be created:
Alas, Azure Flexible servers do not support IPv6 virtual nets, so we'll have to find another solution. Current scenario is to create a dedicated virtual net, IPv4 only, and study the methods to access it privately.
As per https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-networking:
=> as we want a managed database with no public endpoint, it means we need a dedicated vnet + (delegated) subnet for the new Postgres flexbible instance. Only unknown is the peering behavior with IPv6 / IPv4.
public-db
instance is now working, reachable from both private
and public
vnetspublic-db
by Terraform (and the postgres provider)public
network throug the private VPNKeycloak is migrated with success to the new cluster:
Removed the former ingress to confirm and looks good for both @smerle33 and I with the private VPN connected
Next step: gotta remove the migrated resources from prodpublick8s to be really sure that nothing runs on the former cluster (starting with keycloak only as it would only iumpact infra team: gotta wait for annoucement for public services)
Update:
Keycloak cleanup:
prodpublick8s
with successpublic
PgSQL instance with successRemoved the jenkins-weekly
, javadoc
, jenkinsisthewayio-redirector
and wiki
namespaces from `prodpublick8s (already migrated)
Removed the (empty) namespace archives
from prodpublick8s
Next candidates: Plugin Helath Score and rating
Migration of plugin-health score done, in mob-programming with the help of @smerle33 and @alecharp
prodpublick8s
, delete ingress, and check: no impact on plugins.jenkins.io (only on the static generation)prodpublick8s
public-db
public
instance✅ 🚀
Update:
plugin-health-score
(we have to do this for each migration): https://github.com/jenkins-infra/kubernetes-management/pull/4003javadoc
and wiki
: https://github.com/jenkins-infra/azure-net/pull/86grafan
DNS records on both jenkins.io
and jenkins-ci.org
domains:Migration of the Incremental Publisher service (https://github.com/jenkins-infra/iep/tree/master/iep-009):
Migration of the Rating service
prodpublick8s
(removing ingress manually to fail incoming requests)
Migration of the Uplink service
Main challenge is the database size: 88% storage used of the available 85 Gb for the current PostgreSQL Single Server:
Update:
public-db
instance to sustain the uplink
databaseUpdate "Uplink":
on the (new.)ci.jenkins.io:
/var/lib/jenkins/
: 478 Gb availablepsql --host=produplink.postgres.database.azure.com --username='uplinkadmin@produplink' --password uplink
VACUUM VERBOSE
in a screen named uplink_vacuum
psql --host=public-db.postgres.database.azure.com --username=uplink --password uplink
time pg_dump --host=produplink.postgres.database.azure.com --username='uplinkadmin@produplink' --password --dbname=uplink --compress=0 --jobs=8 --format=d --file=/var/lib/jenkins/uplink_$(date +%Y%m%d_%H%M%S)
in a screen named uplink_dump
Update in uplink:
Update with uplink
: as the "old" PostgreSQL is reachabale from the new cluster and has too much data, I've opened https://github.com/jenkins-infra/helpdesk/issues/3609 for the database migration.
Also, it seems that the "old" database is a Postgres 10 instance while our public-db
is version 13.
Gotta migrate uplink today, keeping the same database as before to avoid too many changes.
(updated) Plan for migrating Uplink:
x86medium
publick8s node pool - https://github.com/jenkins-infra/kubernetes-management/pull/4025Migration of uplink.jenkins.io completed, no service interruption.
Migration of the Reports service:
nodeSelector
) -https://github.com/jenkins-infra/kubernetes-management/pull/4028moved
block) - https://github.com/jenkins-infra/azure-net/pull/93reports
namespace in prodpublick8s
clusterMigration of https://reports.jenkins.io completed, no service interruption.
Migration of accountapp service:
publick8s
already taken care of while migrating Keycloak servicenodeSelector
) - https://github.com/jenkins-infra/kubernetes-management/pull/4030moved
block)
accounts.jenkins-ci.org
CNAME repo in terraformterraform import 'azurerm_dns_cname_record.jenkinsciorg_target_public_publick8s["accounts"]' /subscriptions/<subscription-id>/resourceGroups/proddns_jenkinsci/providers/Microsoft.Network/dnsZones/jenkins-ci.org/CNAME/accounts
accountapp
namespace in prodpublick8s
clusterMigration of https://accounts.jenkins.io completed, no service interruption.
Migration of the LDAP service:
publick8s-tier
subnet - https://github.com/jenkins-infra/azure-net/pull/96nodeSelector
) and the new storage account credentials, but keeping the release on prodpublick8s until the end of the migration - https://github.com/jenkins-infra/kubernetes-management/pull/4042slapd
container on publick8s
, took a bit more than 5 minutes to complete service slapd stop && ./entrypoint/restore
ldap
A record as code (and manually reduced its TLL to 60s) - https://github.com/jenkins-infra/azure-net/pull/97moved
block) from publick8s to status.jenkins.io as we need to stop writes on LDAP while performing the LDIF dump - https://github.com/jenkins-infra/azure-net/pull/98
moved
block as it's only an A record modification, not a transfer of block like for the CNAME ones) from prodpublick8s to publick8s
- https://github.com/jenkins-infra/azure-net/pull/99publick8s
ldap
release from prodpublick8s
- https://github.com/jenkins-infra/kubernetes-management/pull/4053ldap
service in prodpublick8s
cluster (scaled to 0)moved
block) from status.jenkins.io to publick8s - https://github.com/jenkins-infra/azure-net/pull/102ldap
namespace in prodpublick8s
clusterUpdate on the LDAP: https://github.com/jenkins-infra/azure/pull/385#issuecomment-1577251779
=> there are missing elements to allow managing storage accounts in some of the publick8s networks
All preliminary steps for the LDAP migration are completed, I'll proceed to the switch tomorrow.
Although the intended redirections from accounts.jenkins.io & accounts.jenkins-ci.org to status.jenkins.io didn’t work as expected (SAN cert issue?), the LDAP migration has been successfully completed, no service interruption.
Migration of mirrorbits (https://get.jenkins.io, https://mirrors.jenkins.io, https://mirrors.jenkins-ci.org, https://fallback.get.jenkins.io) service:
publick8s
[x] Deploy on the new Kubernetes cluster
publick8s
(and add nodeSelector
) - https://github.com/jenkins-infra/kubernetes-management/pull/4062[x] Verify the new application is up and running (testing its ingress and checking logs)
moved
block) - https://github.com/jenkins-infra/azure-net/pull/104prodpublick8s
clustermirrorbits
namespace in prodpublick8s
clusterMigration of plugin-site-issues service:
publick8s
(and add nodeSelector
) - https://github.com/jenkins-infra/kubernetes-management/pull/4064moved
block) - https://github.com/jenkins-infra/azure-net/pull/105prodpublick8s
clusterplugin-site-issues
namespace in prodpublick8s
cluster~ plugin-site-issues is deployed in the plugin-site
namespace, will be deleted during plugin-site migration later.Migration of mirrrorbits and plugin-site-issues completed, no service interruption.
As a precaution, we'll delete the mirrorbits
namespace from prodpublick8s
tomorrow or Wednesday.
Migration of plugin-site service:
publick8s
(and add nodeSelector
) - https://github.com/jenkins-infra/kubernetes-management/pull/4067plugins.origin.jenkins.io
(with moved
block) - https://github.com/jenkins-infra/azure-net/pull/107prodpublick8s
clusterplugin-site
namespace in prodpublick8s
clusterMigration of jenkins.io service:
publick8s
(and add nodeSelector
) - https://github.com/jenkins-infra/kubernetes-management/pull/4068www.origin.jenkins.io
(with moved
block) - https://github.com/jenkins-infra/azure-net/pull/108prodpublick8s
clusterprodpublick8s
clusterplugin-site migration completed, no service interruption.
jenkins.io migration completed, no service interruption.
52.167.253.43
(prodpublick8s
public IP)20.119.232.75
(publick8s
public IP)
publick.aks.jenkins.io
(ie prodpublick8s
cluster):*
10.0.2.5
(prodpublick8s
private IP)private.aks.jenkins.io
(ie prodpublick8s
cluster):*
: need additional cleanup in:
As we've noticed quite a lot of remaining requests still send to mirrorbits
on prodpublick8s
, we'll postpone the cluster deletion to next week, and @dduportal will see for the publication of a blogpost indicating the migration of this service to the new cluster.
Namespaces removal: @lemeurherve and I paired and removed the following namespaces from prodpublick8s
:
datadog
(causing https://github.com/jenkins-infra/datadog/pull/193 😅)cert-manager
private-nginx-ingress
.Remaining namespaces are required until https://github.com/jenkins-infra/helpdesk/issues/3351#issuecomment-1591672053 is fixed.
As we've noticed quite a lot of remaining requests still send to
mirrorbits
onprodpublick8s
, we'll postpone the cluster deletion to next week, and @dduportal will see for the publication of a blogpost indicating the migration of this service to the new cluster.
As discussed with the last infrastructure meeting:
prodpublick8s
cluster the Tuesday 27 June 2023, including the former mirror service and the former IPv4Update:
prodpublick8s
destroyed with its resource groupAdditional monitors added: https://github.com/jenkins-infra/datadog/pull/195
Potential improvements for later:
After 3 years and 27 days of good and faithful service, prodpublick8s
is not anymore, closing this issue 🤗
This issue tracks the work for spawning a new "public" AKS cluster for production to replace the former
prodpublick8s
.Goals:
prodpublick8s
This issue is the "twin" of https://github.com/jenkins-infra/helpdesk/issues/2844 but for public network.
Some notes:
privatek8s
)privatek8s
- Only reference tohighmem
string in the context of Kubernete in https://github.com/search?q=org%3Ajenkins-infra+highmem&type=code is https://github.com/jenkins-infra/release/blob/902df23d5657c074f184d8e08d1d00d7e3e67c95/PodTemplates.d/release-linux.yaml#L44)agentpool
forprodpublick8s
should have an equivalent (in term of VM size, disk size and autoscaling limits) in the newprivatek8s
privatek8s
, as a good AKS practise (ref. https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli and https://learn.microsoft.com/en-us/azure/aks/use-multiple-node-pools)outboundType
) type to use: https://learn.microsoft.com/en-us/azure/aks/egress-outboundtypeprodpublick8s
to this cluster: