jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

Proposal for application in publick8s to migrate to arm64 #3619

Closed smerle33 closed 10 months ago

smerle33 commented 1 year ago

Service(s)

Azure

Summary

Work in progress: determining candidates to migrate on the arm node pool

existing deployments on publick8s :

Progress

https://github.com/jenkins-infra/helpdesk/issues/3619#issuecomment-1713522131

smerle33 commented 1 year ago

as per https://github.com/jenkins-infra/kubernetes-management/pull/4310 wiki is now on the ARM64 nodepool

smerle33 commented 1 year ago

Definitive candidate list (WIP, need to merge with pods list):

smerle33 commented 1 year ago

working on https://github.com/jenkins-infra/helpdesk/issues/3719 before going further to avoid service interruption during migration

smerle33 commented 1 year ago

PluginHealthScoring is now on arm64

smerle33 commented 1 year ago

rating as been migrated to ARM64 successfully

smerle33 commented 1 year ago

private nginx ingress as been migrated to ARM64 successfully

lemeurherve commented 11 months ago

Update:

public-nginx-ingress

Migration was successful. (responsible for a majority of our public facing services, the stakes were high)

mirrorbits

The migration was planned, announced and tried, but has been rollbacked as the mirrorbits arm64 image contained incompatible binaries:

exec /bin/tini: exec format error

tini has been since fixed with https://github.com/jenkins-infra/docker-mirrorbits/pull/17, but mirrorbits binary is still incompatible. I've opened a POC to build it for arm64 architecture in https://github.com/jenkins-infra/docker-mirrorbits/pull/18 but this remains a POC, ultimately we'll need it to come from the original repo, or a fork.

For now we're postponing mirrorbits migration to arm64.

Notes:

uplink:

The project hasn't been built since 2019, and isn't configured anymore on ci.jenkins.io, restoring it by adding it to the "infra" folder.

Screenshot: image

Then, the build on the primary branch fails, same error as the one I've obtained building it locally: https://ci.jenkins.io/job/Infra/job/uplink/job/master/1/console

 The NODE_ENV variable is not set. Defaulting to a blank string.
 The DEBUG variable is not set. Defaulting to a blank string.
 The SENTRY_DSN variable is not set. Defaulting to a blank string.
 Creating network "infrauplinkmaster_default" with the default driver
 Pulling db (postgres:alpine)...
 alpine: Pulling from library/postgres
 Digest: sha256:acf5271bbecd4b8733f4e93959a8d2b536a57aeee6cc4b6a71890aaf646425b8
 Status: Downloaded newer image for postgres:alpine
 Creating infrauplinkmaster_db_1 ... 
 
Creating infrauplinkmaster_db_1 ... done
>> waiting a moment to make sure the database comes online..
 ./tools/docker-compose run --rm node \
    /usr/local/bin/node /home/jenkins/agent/workspace/Infra_uplink_master/node_modules/sequelize-cli/lib/sequelize db:migrate && \
 ./tools/docker-compose run --rm node \
    /usr/local/bin/node /home/jenkins/agent/workspace/Infra_uplink_master/node_modules/sequelize-cli/lib/sequelize db:seed:all
 The NODE_ENV variable is not set. Defaulting to a blank string.
 The DEBUG variable is not set. Defaulting to a blank string.
 The SENTRY_DSN variable is not set. Defaulting to a blank string.
 Starting infrauplinkmaster_db_1 ... 
 
Starting infrauplinkmaster_db_1 ... done

 Sequelize CLI [Node: 10.24.1, CLI: 4.1.1, ORM: 4.38.0]

 Loaded configuration file "config/database.js".
 Using environment "development".
 Fri, 27 Oct 2023 09:54:41 GMT sequelize deprecated String based operators are now deprecated. Please use Symbol based operators for better security, read more at http://docs.sequelizejs.com/manual/tutorial/querying.html#operators at node_modules/sequelize/lib/sequelize.js:242:13
 /home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:441
   throw new Error('Unknown authenticationOk message type' + util.inspect(msg))
   ^

 Error: Unknown authenticationOk message typeMessage { name: 'authenticationOk', length: 23 }
     at Connection.parseR (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:441:9)
     at Connection.parseMessage (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:357:19)
     at Socket.<anonymous> (/home/jenkins/agent/workspace/Infra_uplink_master/node_modules/pg/lib/connection.js:119:22)
     at Socket.emit (events.js:198:13)
     at addChunk (_stream_readable.js:288:12)
     at readableAddChunk (_stream_readable.js:269:11)
     at Socket.Readable.push (_stream_readable.js:224:10)
     at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
 make: *** [Makefile:62: migrate] Error 1

At least it's running with make run (tested locally):

image
lemeurherve commented 11 months ago

Update:

uplink

The project is built again, on ci.jenkins.io and infra.ci.jenkins.io, with deterministic builds (npm ci), and deployed with success:

The arm64 image has been published:

A status incident has been opened to announce its migration today: https://status.jenkins.io/issues/2023-10-31-uplink-arm64-migration/

Note: the same procedure as in https://github.com/jenkins-infra/kubernetes-management/pull/4607#issuecomment-1786905267 is needed to verify the deployment, triggering a new event, check its presence in uplink UI, and download an export.

lemeurherve commented 11 months ago

Update:

uplink

After fixing the chart by using named tags instead of sha256 reference for the Docker image (preventing the deployment of arm64 architecture variant), uplink has been successfully migrated to arm64:

TODO:

lemeurherve commented 11 months ago

Update:

uplink

PR opened to adapt the updatecli manifest:

mirrorbits

I opened a PoC for running mirrorbits on arm64 before:

Now that there is progress in etix/mirrorbits repository, I've opened a feature request to get arm64 binaries from the official repo:

lemeurherve commented 11 months ago

Update: datadog-cluster-agent & cert-manager migrated to arm64.

lemeurherve commented 11 months ago

Update

plugin-site-api

Migrate the container publication from trusted.ci.jenkins.io to infra.ci.jenkins.io:

lemeurherve commented 11 months ago

Update

plugin-site-api

Arm64 image published from infra.ci.jenkins.io, chart updated, ready to migrate.

plugin-site-issues

Arm64 image published, ready to migrate.

Next steps

We can proceed to plugin-site components migration to arm64.

Then we'll migrate weekly.ci.jenkins.io to arm64, the corresponding arm64 image is already published.

lemeurherve commented 11 months ago

Last plugin-site helm chart version including the tag with an arm64 variation deployed on publick8s cluster. Only remain the helmfile release changes to migrate its components to arm64.

smerle33 commented 11 months ago

plugin-site and plugin-site-issues migration to arm64 done.

Migration plugin-site and plugin-site-issues post mortem:

I did forget to update the charts version in the PR : https://github.com/jenkins-infra/kubernetes-management/pull/4683 I created the PR https://github.com/jenkins-infra/kubernetes-management/pull/4684 to fix it but the helm engine locked the update with: Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

The solution was to rollback : first we list the releases : helm ls --namespace plugin-site then we revert: helm rollback --namespace plugin-site plugin-site and helm rollback --namespace plugin-site plugin-site-issues we can then launch a new kubernetes-management build in the infra.ci.jenkins.io

In order to avoid this kind of problem, I think that I need to first check the opened PR related to this release (we had https://github.com/jenkins-infra/kubernetes-management/pull/4671) and to better plan the migration to make sure that more than one can focus and check the PR.

This had no impact on production as kubernetes was able to hold the upgrade as it was not successful.

dduportal commented 11 months ago

Before closing this PR, the following services will have to be migrated to arm64:

Next step after this issue (to be continued and detailed):

smerle33 commented 11 months ago

mirrorbits-parent : httpd and rsyncd are now on arm64:

Capture d’écran 2023-11-20 à 17 34 05
smerle33 commented 11 months ago

first attempt to move WEEKLY.CI.JENKINS.IO to arm64 was a failure, but probably due to my impatience Next attempt will involved before merging the PR:

smerle33 commented 10 months ago

first attempt to move WEEKLY.CI.JENKINS.IO to arm64 was a failure, but probably due to my impatience Next attempt will involved before merging the PR:

* manual scaling +1 arm node

* manually change the statefulset to downsize to 0 to help the volume migration

More testing and investigation brought us to discovering a zone incompatibility. ARM VM for the node pool are only available on zone 1 (useast2-1) while our others nodepool are located in zone 3 (useast2-3). The problem remain on the volumes that are not able to be mounted from one zone to another. We decided the following plan :

Capture d’écran 2023-11-24 à 16 18 33 Capture d’écran 2023-11-24 à 16 18 43
smerle33 commented 10 months ago

We created a new class : https://github.com/jenkins-infra/azure/pull/526/ to use ZRS storage. and use a temporary PV/PVC on this volume in order to use it as a source for the migration

it looks like that there was a bug with the CSI volume clone, it failed with the following error on the new PVC created by the cloning system :

Warning  ProvisioningFailed    23s (x6 over 54s)  disk.csi.azure.com_csi-azuredisk-controller-68cfbf9cc6-vknhs_e4bc448f-ce5d-4e2c-9af5-2f47223a8443  failed to provision volume with StorageClass "managed-csi-premium-zrs-retain": rpc error: code = Internal desc = sourceResourceID(/subscriptions/redacted/resourceGroups/mc_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks//subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snap) is invalid, correct format: .*/subscriptions/(?:.*)/resourceGroups/(?:.*)/providers/Microsoft.Compute/disks/(.+)   

we changed the volumeHandle of the source PV from /subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snapto jenkins-weekly-snap (we exploited the csi clone bug)

Everything went well: WEEKLY.CI.JENKINS.IO now runs on ARM64 🚀