[PuppetMaster] Migrate VM from OSUOSL to Azure

dduportal commented 1 year ago

Service(s)

Azure, Other

Summary

Upgrade of the puppet.jenkins.io to Ubuntu 22.04 broke the Puppet Enterprise server in https://github.com/jenkins-infra/helpdesk/issues/2982#issuecomment-1570715518 as it Jammy is not supported by PE 🤦

This issue tracks the work to migrate the VM to an Azure Terraform-managed VM to restore the service (as we have backups taken before the Ubuntu migration).

Pros:

We can have almost the same setup as before
The OSUOSL VM will be, like edamame or lettuce, considered disposable for build agents
We'll control a bit more the VM lifecycle (and its accesses as it hosts some credentials)

Cons:

Stuck to Ubuntu 20.04 for now
Stuck to Puppet Enterprise (free usage up to 10 agents: limit we'll reach soon)

Reproduction steps

No response

dduportal commented 1 year ago

Update:

New VM created: https://github.com/jenkins-infra/azure/pull/375
WiP on Puppet master restore (will update issue with notes)

dduportal commented 1 year ago

Puppet Server installation:

Packages: apt update / dist-upgrade / reboot / apt install --no-install-recommends vim htop iotop dnsutils bash-completion gpg-agent git net-tools / autoremove --purge
https://www.puppet.com/docs/pe/2019.8/installing_pe.html
Current (for the backup: https://www.puppet.com/releases/2019.8.11)
Downloaded archive, checked the GPG pub key (old one from PE), unarchive
locale-gen en_US.UTF-8 / apt-get install cron (=> cloudinit for all VMs)
Run the installer (eventually multiple times until it's successfull)
Puppet Master only: apt install git openssh-client --no-install-recommends
Add 20.12.27.65 puppet.jenkins.io to /etc/hosts to avoid unwanted connection to the former VM
Retrieve the backup file (1.3 Gb, 5 min upload) + test integrity
Logout / Login to have the puppet-* commands available in PATH
Set hostname:
- Point DNS record puppet.jenkins.io to th VM (CNAME)
- hostnamectl set-hostname puppet.jenkins.io && hostname -f # puppet.jenkins.io
- Check content of /etc/puppetlabs/puppet/puppet.conf for proper hostname
Retrieve the hieradata-eyaml encryption key from former machine
- Basic Doc: https://github.com/voxpupuli/hiera-eyaml#generate-keys (except keys already exist)
- Retrieving keys in /var/lib/puppet/keys, pe-puppet user should be the owner, read only for user (chmod 0600):

$ ls -l /var/lib/puppet/keys
total 8
-r-------- 1 pe-puppet root 1679 Jun  1 10:52 private_key.pkcs7.pem
-r-------- 1 pe-puppet root 1050 Jun  1 10:52 public_key.pkcs7.pem

Restore: puppet-backup restore /root/pe_backup-2023-05-31_14.55.46_UTC.tgz (Ref. https://www.puppet.com/docs/pe/2019.8/backing_up_and_restoring_pe.html)

# Check the Master hostname IS "puppet.jenkins.io" 

### 
Step 1 of 10: Stopping PE related services
# ...#
# Stuck at 10 of 10, because 
# - Service pe-puppetdb stuck during its startup: https://tickets.puppetlabs.com/browse/PDB-4785
# - Logs in /var/log/puppetlabs/puppetdb/puppetdb.log shows postgres is started, but the connection  puppetdb <-> postgres fails during TLS handshake (confirmed with tcpdump)
# - https://tickets.puppetlabs.com/browse/PDB-4625

Looking at https://www.puppet.com/docs/puppetdb/7/postgres_ssl.html#using-a-custom-java-keystore (yes, version 7 but the keystore is the same)

dduportal commented 1 year ago

Trying disabling SSL for pupeptdb PgSQL conenction: still the "connection timeout" error. I was led in a bad path by the Puppet issues above.
Trying a curl -v puppet.jenkins.io:5432 helps to reproduce the "connection timeout" error: using the public IP forces TPC packets to exit the VM to the dmz subnet where the security groups forbid inbound requests on the 5432 port => that is the real reason
Solution: Update /etc/hosts with the private IP instead. Solved the problem!
A new cycle of unsintall and reinstall, following the same steps as the comment above
Restore went well:

Log messages will be saved to /var/log/puppetlabs/pe-backup-tools/pe_restore-2023-06-01_13.24.25_UTC.log

Step 1 of 10: Stopping PE related services
Step 2 of 10: Cleaning the agent certificates from previous PE install
Step 3 of 10: Restoring PE file system components
Step 4 of 10: Restoring the pe-orchestrator database
Step 5 of 10: Restoring the pe-rbac database
Step 6 of 10: Restoring the pe-classifier database
Step 7 of 10: Restoring the pe-activity database
Step 8 of 10: Restoring the pe-inventory database
Step 9 of 10: Restoring the pe-puppetdb database
Step 10 of 10: Configuring PE on newly restored master

Backup restored.
  Time to restore: 4 min, 6 sec
  Size: 2.26 GB, Scope: code, puppetdb, config, certs

To finish restoring your primary server from backup, run the following commands:
puppet agent --test

Deploying r10k environment:

$ ls -l /root/.ssh/config /root/.ssh/deploy_key 
-rw-r--r-- 1 root root   55 Jun  1 10:57 /root/.ssh/config
-r-------- 1 root root 1679 Jun  1 10:57 /root/.ssh/deploy_key
$ cat /root/.ssh/config 
Host github.com
    IdentityFile /root/.ssh/deploy_key
$ ssh -T git@github.com
Hi jenkins-infra/jenkins-keys! You've successfully authenticated, but GitHub does not provide shell access.

$ r10k deploy environment --color --verbose --puppetfile
# No errors, WARN accepted

Initial agent run throws errors:

$ puppet agent --test
# ...
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find class pe_console_prune for puppet.jenkins.io on node puppet.jenkins.io
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/datadog_reports: Datadog report config file /etc/datadog-agent/datadog-reports.yaml not readable

After a lot of dies and retries, solved it with a hack:
- Retrieved the module pe_console_prune from the radish VM (/opt/puppetlabs/puppet/modules/pe_console_prune) and copied it to the new machine
- Ran a puppetserver gem cleanup
Initial agent run was successfull \o/
Notes:
- Need to find and remove the pe_console_prune requirement. Found an occurence in /opt/puppetlabs/server/data/puppetserver/yaml/node/puppet.jenkins.io.yaml (restored from the backup) and removed it but this one might be cached somewhere and the agent run still showed the error.
- puppet module list shows a LOT of incompatible dependencies, logged as WARN message. Not blocking but it looks like a lot of the modules (example: datadog and apt) are not really updated between each others
- Need to add the DNS records in Terraform
- Need to update the security groups to allow the new IP associated to the new VM

dduportal commented 1 year ago

Update:

The new puppetmaster is up and running in Azure with Ubuntu 20.04 ✅
Updated the NSGs in Azure to allow outbound connections:
- https://github.com/jenkins-infra/azure/pull/376
- https://github.com/jenkins-infra/azure-net/pull/91
The puppet agent went fine on puppet.jenkins.io itself
- The infra-butler bot is back and sends notifications in IRC from the new Azure VM
Restarting the puppet agent processes (first a noop, and if it works then a full enable + restart) on the following machines:
- [x] ci.jenkins.io
- [x] cert.ci.jenkins.io
- [x] archives.jenkins.io
- [x] pkg.origin.jenkins.io
- [x] edamame.jenkins.io (⚠️ had to cleanup the /etc/hosts which had unused entries including one for puppet.jenkins.io )
- [x] lettuce.jenkins.io (⚠️ had to cleanup the /etc/hosts which had unused entries including one for puppet.jenkins.io )
- [x] census.jenkins.io
- [x] usage.jenkins.io
- [x] trusted.ci.jenkins.io
- [x] trusted-agent-1
- [x] vpn.jenkins.io
- [x] private.vpn.jenkins.io
- [x] bounce

dduportal commented 1 year ago

Searched for occurences of the puppet.jenkins.io in the code (https://github.com/search?q=org%3Ajenkins-infra%20puppet.jenkins.io&type=code) led to:
- Updated runbook: https://github.com/jenkins-infra/runbooks/commit/e8cc5047a044d1894b672b90fb9903a024f1e4e2
- Updated LDAP LB: https://github.com/jenkins-infra/kubernetes-management/pull/4026
Searched for occurences of the old IP (140.211.9.94) in the code (https://github.com/search?q=org%3Ajenkins-infra+140.211.9.94&type=code) does not show another occurence

dduportal commented 1 year ago

Closing as it works as expected

jenkins-infra / helpdesk