jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[ci.jenkins.io] Use a new VM instance type #3535

Closed dduportal closed 1 year ago

dduportal commented 1 year ago

What is the problem?

The current VM for ci.jenkins.io starts to show issues:

Also, this VM was sized a few years ago with a slighlty different context: JDK8 for running the controller (e.g. less CPU usage but more memory usage), no UEFI bootloader (v1 generation), Ubuntu 18.04.

Finally, managing this VM is manually managed for the infrasrtucture layer (initially created with Terraform, but then changed to manual management).

What should we do

There are numerous tasks for this VM:

How could we do it

Proposal: to avoid any maintenance overhead and migration risk, the infra team thought of the following plan:

=> this would avoid disrupting the current ci.jenkins.io service until the effective migration

Validation steps would be:

dduportal commented 1 year ago

Should use a backup policy (merged #3527):

The goal is to ensure we have a daily backup of the JENKINS_HOME of ci.jenkins.io

Azure provides a Backup System, than can be used specifically for managed disks such as this one: https://learn.microsoft.com/en-us/azure/backup/backup-managed-disks.

We don't (and should not) need a VM-level backup as we use Puppet to manage the system: disaster recovery for ci.jenkins.io is to install a blank new VM and mount the resotre of the datadisk for Jenkins.

As per https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/data_protection_backup_instance_disk, we can define this using Terraform which implies importing ci.jenkins.io VM once for all.

A word about encryption:

  • The backup vault is, like the VM disks, encrypted at rest with an Azure PMK key (hardware level).
  • We can keep this behavior (encryption at rest with PMK) for the backup, as ci.jenkins.io deos not have any senstivie data (eventually credentials for GH org, but that is all).
  • Note: This encryption could be provided a custom key private for sensitvie backups such as trusted.ci's
dduportal commented 1 year ago
dduportal commented 1 year ago

Delaying as it's blocked by the network peering accesses part of #3351 and by the work on the new trusted machines in #3486

dduportal commented 1 year ago

Current status: we have to set up the Azure VM agents to inbound mode (thanks to https://github.com/jenkinsci/azure-vm-agents-plugin/pull/406) to ensure that migrating either the controller first or the ephemeral agents first won't break the connections (otherwise we would need to use public IPs for SSH agent temporarly)

dduportal commented 1 year ago

Update: ci.jenkins.io is now using inbound agent running from the new virtual network.

Watching the builds (ping @lemeurherve not urgent but I'll try to check the CI integration in datadog to see if any pattern arise here - #3573)

dduportal commented 1 year ago

Next step: bootstraping a fully operational VM for the new ci.jenkins.io

⚠️ The old JENKINS_HOME seems to have inode issues (wether the old disk or the snapshots). Despite copying the full disk yesterday, the rsync sees all files as changed: the copy is being done but will take multiple hours (~ 12 hours). ci.jenkins.io will remain down until 5 July.

dduportal commented 1 year ago

Update (4th of July):

dduportal commented 1 year ago

Update (5th of July):

dduportal commented 1 year ago

Update (5th of July)

dduportal commented 1 year ago

Todo list to close this issue:

dduportal commented 1 year ago

Closing the issue as the work is finished