Closed smerle33 closed 10 months ago
as per https://github.com/jenkins-infra/kubernetes-management/pull/4310 wiki is now on the ARM64 nodepool
Definitive candidate list (WIP, need to merge with pods list):
working on https://github.com/jenkins-infra/helpdesk/issues/3719 before going further to avoid service interruption during migration
PluginHealthScoring is now on arm64
rating as been migrated to ARM64 successfully
private nginx ingress as been migrated to ARM64 successfully
Update:
Migration was successful. (responsible for a majority of our public facing services, the stakes were high)
The migration was planned, announced and tried, but has been rollbacked as the mirrorbits arm64 image contained incompatible binaries:
exec /bin/tini: exec format error
tini
has been since fixed with https://github.com/jenkins-infra/docker-mirrorbits/pull/17, but mirrorbits
binary is still incompatible. I've opened a POC to build it for arm64 architecture in https://github.com/jenkins-infra/docker-mirrorbits/pull/18 but this remains a POC, ultimately we'll need it to come from the original repo, or a fork.
For now we're postponing mirrorbits
migration to arm64.
Notes:
We've also encountered an architecture issue with httpd
in the mirrorbits-file pod, which is more surprising as it should and it is compatible with arm64:
exec /usr/local/bin/httpd-foreground: exec format error Let's wait for the
mirrorbits
arm64 image to look at it again.
The migration release got stuck with its pods in pending (confirmed with helm -n mirrorbits ls
), we (Damien) had to rollback it manually. (helm -n mirrorbits rollback mirrorbits
)
The project hasn't been built since 2019, and isn't configured anymore on ci.jenkins.io, restoring it by adding it to the "infra" folder.
Then, the build on the primary branch fails, same error as the one I've obtained building it locally: https://ci.jenkins.io/job/Infra/job/uplink/job/master/1/console
The NODE_ENV variable is not set. Defaulting to a blank string.
The DEBUG variable is not set. Defaulting to a blank string.
The SENTRY_DSN variable is not set. Defaulting to a blank string.
Creating network "infrauplinkmaster_default" with the default driver
Pulling db (postgres:alpine)...
alpine: Pulling from library/postgres
Digest: sha256:acf5271bbecd4b8733f4e93959a8d2b536a57aeee6cc4b6a71890aaf646425b8
Status: Downloaded newer image for postgres:alpine
Creating infrauplinkmaster_db_1 ...
[1A[2K
Creating infrauplinkmaster_db_1 ... [32mdone[0m
[1B>> waiting a moment to make sure the database comes online..
./tools/docker-compose run --rm node \
/usr/local/bin/node /home/jenkins/agent/workspace/Infra_uplink_master/node_modules/sequelize-cli/lib/sequelize db:migrate && \
./tools/docker-compose run --rm node \
/usr/local/bin/node /home/jenkins/agent/workspace/Infra_uplink_master/node_modules/sequelize-cli/lib/sequelize db:seed:all
The NODE_ENV variable is not set. Defaulting to a blank string.
The DEBUG variable is not set. Defaulting to a blank string.
The SENTRY_DSN variable is not set. Defaulting to a blank string.
Starting infrauplinkmaster_db_1 ...
[1A[2K
Starting infrauplinkmaster_db_1 ... [32mdone[0m
[1B
[4mSequelize CLI [Node: 10.24.1, CLI: 4.1.1, ORM: 4.38.0][24m
Loaded configuration file "config/database.js".
Using environment "development".
Fri, 27 Oct 2023 09:54:41 GMT sequelize deprecated String based operators are now deprecated. Please use Symbol based operators for better security, read more at http://docs.sequelizejs.com/manual/tutorial/querying.html#operators at node_modules/sequelize/lib/sequelize.js:242:13
/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:441
throw new Error('Unknown authenticationOk message type' + util.inspect(msg))
^
Error: Unknown authenticationOk message typeMessage { name: 'authenticationOk', length: 23 }
at Connection.parseR (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:441:9)
at Connection.parseMessage (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:357:19)
at Socket.<anonymous> (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:119:22)
at Socket.emit (events.js:198:13)
at addChunk (_stream_readable.js:288:12)
at readableAddChunk (_stream_readable.js:269:11)
at Socket.Readable.push (_stream_readable.js:224:10)
at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
make: *** [Makefile:62: migrate] Error 1
At least it's running with make run
(tested locally):
Update:
The project is built again, on ci.jenkins.io and infra.ci.jenkins.io, with deterministic builds (npm ci
), and deployed with success:
The arm64 image has been published:
A status incident has been opened to announce its migration today: https://status.jenkins.io/issues/2023-10-31-uplink-arm64-migration/
Note: the same procedure as in https://github.com/jenkins-infra/kubernetes-management/pull/4607#issuecomment-1786905267 is needed to verify the deployment, triggering a new event, check its presence in uplink UI, and download an export.
Update:
After fixing the chart by using named tags instead of sha256 reference for the Docker image (preventing the deployment of arm64 architecture variant), uplink has been successfully migrated to arm64:
TODO:
Update:
PR opened to adapt the updatecli manifest:
I opened a PoC for running mirrorbits on arm64 before:
Now that there is progress in etix/mirrorbits repository, I've opened a feature request to get arm64 binaries from the official repo:
Update: datadog-cluster-agent & cert-manager migrated to arm64.
Update
Migrate the container publication from trusted.ci.jenkins.io to infra.ci.jenkins.io:
buildDockerAndPublishImage
in Jenkinsfile (https://github.com/jenkins-infra/plugin-site-api/pull/142)Update
Arm64 image published from infra.ci.jenkins.io, chart updated, ready to migrate.
Arm64 image published, ready to migrate.
We can proceed to plugin-site components migration to arm64.
Then we'll migrate weekly.ci.jenkins.io to arm64, the corresponding arm64 image is already published.
Last plugin-site helm chart version including the tag with an arm64 variation deployed on publick8s cluster. Only remain the helmfile release changes to migrate its components to arm64.
plugin-site
and plugin-site-issues
migration to arm64 done.
Migration plugin-site and plugin-site-issues post mortem:
I did forget to update the charts version in the PR : https://github.com/jenkins-infra/kubernetes-management/pull/4683
I created the PR https://github.com/jenkins-infra/kubernetes-management/pull/4684 to fix it but the helm engine locked the update with: Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
The solution was to rollback :
first we list the releases : helm ls --namespace plugin-site
then we revert: helm rollback --namespace plugin-site plugin-site
and helm rollback --namespace plugin-site plugin-site-issues
we can then launch a new kubernetes-management build in the infra.ci.jenkins.io
In order to avoid this kind of problem, I think that I need to first check the opened PR related to this release (we had https://github.com/jenkins-infra/kubernetes-management/pull/4671) and to better plan the migration to make sure that more than one can focus and check the PR.
This had no impact on production as kubernetes was able to hold the upgrade as it was not successful.
Before closing this PR, the following services will have to be migrated to arm64
:
httpd
(in both mirrorbits
and mirrorbits-parent
releases)rsyncd
(in mirrorbits-parent
release)weekly.ci.jenkins.io
Next step after this issue (to be continued and detailed):
mirrorbits-parent : httpd and rsyncd are now on arm64:
first attempt to move WEEKLY.CI.JENKINS.IO
to arm64 was a failure, but probably due to my impatience
Next attempt will involved before merging the PR:
first attempt to move
WEEKLY.CI.JENKINS.IO
to arm64 was a failure, but probably due to my impatience Next attempt will involved before merging the PR:* manual scaling +1 arm node * manually change the statefulset to downsize to 0 to help the volume migration
More testing and investigation brought us to discovering a zone incompatibility. ARM VM for the node pool are only available on zone 1 (useast2-1) while our others nodepool are located in zone 3 (useast2-3). The problem remain on the volumes that are not able to be mounted from one zone to another. We decided the following plan :
We created a new class : https://github.com/jenkins-infra/azure/pull/526/ to use ZRS storage. and use a temporary PV/PVC on this volume in order to use it as a source for the migration
it looks like that there was a bug with the CSI volume clone, it failed with the following error on the new PVC created by the cloning system :
Warning ProvisioningFailed 23s (x6 over 54s) disk.csi.azure.com_csi-azuredisk-controller-68cfbf9cc6-vknhs_e4bc448f-ce5d-4e2c-9af5-2f47223a8443 failed to provision volume with StorageClass "managed-csi-premium-zrs-retain": rpc error: code = Internal desc = sourceResourceID(/subscriptions/redacted/resourceGroups/mc_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks//subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snap) is invalid, correct format: .*/subscriptions/(?:.*)/resourceGroups/(?:.*)/providers/Microsoft.Compute/disks/(.+)
we changed the volumeHandle of the source PV from /subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snap
to jenkins-weekly-snap
(we exploited the csi clone bug)
Everything went well: WEEKLY.CI.JENKINS.IO now runs on ARM64 🚀
Service(s)
Azure
Summary
Work in progress: determining candidates to migrate on the arm node pool
existing deployments on publick8s :
Progress
https://github.com/jenkins-infra/helpdesk/issues/3619#issuecomment-1713522131