(mirrors.jenkins.io/http://mirrors.jenkins-ci.org/) Sunset the legacy "mirrorbrain" service in favor of get.jenkins.io

dduportal commented 2 years ago

Service(s)

Update center, Other

Summary

What Happened

Since 4 weeks, the infra team receives the following pager duty alert: Weird Response time https://updates.jenkins-ci.org multiple times a day.

Click to see details

The alerts are triggered by a threshold in the datadog metrics collection for this service: https://github.com/jenkins-infra/docker-datadog/blob/main/conf.d/http_check.d/jenkins.yaml#L137-L148. As shown in the screenshots, it means that the average HTTP response time is increased past 10s most of the time (when the alert is triggered). Capture d’écran 2022-04-15 à 11 19 59

Most of the time, the alert acknowledge itself as the response time decreased. Sometimes, the person on duty (@MarkEWaite or I) have to SSH to the machine pkg.origin.jenkins.io and restart the Apache server (rebooting the machine would be the last option).

Root cause

The (legacy) service referenced as mirrorbrain (hosting the services mirrors.jenkins.io and mirrors.jenkins-ci.org), also hosted on this VM is causing a peak of CPU usage which slows done the other service updates.jenkins.io.

Click to expand for details on the configuration as code

Puppet configuration audit trail: - VM definition: https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122 - This VM has the role `mirrorbrain`: https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/role/manifests/mirrorbrain.pp#L4-L7 - This role is composed of 4 profiles: - `base` (as all VMs managed by Puppet): https://github.com/jenkins-infra/jenkins-infra/blob/7c6d6609b650f1ef209cd590dd4568bcc676514c/dist/profile/manifests/base.pp - `mirrorbrain` (which defines `mirrors.jenkins*` services, that we want to sunset): [mirrorbrain](https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/mirrorbrain.pp) - `updatesite` (which defines the update center site, causing alerts because slowed down): https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/updatesite.pp - `pkgrepo` (used to build and host the Jenkins packages for debian/centos/etc. to be replaced later but not part of this issue: keep it for now) - https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122

Proposal

Let's sunset the legacy service mirrorbrain in favor of the current get.jenkins.io modern mirror service based on mirrorbits!

Rationale:

mirrorbits defaults to HTTPS, while mirrorbrain only supports plain old HTTP
Why maintaining 2 different mirror system? End users are not benefiting from this
mirrorbits can scale horizontally and efficiently (redis database, hosted in Kubernetes) and is updated regularly and automatically

Click to expand for details about the mirrorbits service

- Mirrorbits Helm Chart: https://github.com/jenkins-infra/helm-charts/tree/main/charts/mirrorbits - Configuration of get.jenkins.io (installation of the mirrorbits chart): - Helmfile Manifest at https://github.com/jenkins-infra/kubernetes-management/blob/main/clusters/prodpublick8s.yaml#L180-L189 - Custom values: https://github.com/jenkins-infra/kubernetes-management/blob/main/config/mirrorbits.yaml

In order to NOT break end-users installations, the domains mirrors.jenkins.io and mirrors.jenkins-ci.org should be CNAMEs to the mirrorbits new system.

Known usages of the legacy mirror system

Jenkins users that are not able to use HTTPS
Code in the GitHub organization jenkinsci (pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:
Jenkins.io documentations:
- https://github.com/jenkins-infra/jenkins.io/search?q=mirrors.jenkins.io
- https://github.com/jenkins-infra/cn.jenkins.io/search?q=mirrors.jenkins.io
Jenkins Infra: https://github.com/jenkins-infra/jenkins.io/search?q=mirrors.jenkins.io

To Do List

[x] Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)
Communicate to end users:
- [x] Write a blog post on jenkins.io to communicate about the change
- [x] message on mailing lists jenkins-infra and jenkins-dev
- [x] message on the jenkinsci twitter account
- [x] message on IRC jenkins-infra and Gitter jenkins/jenkins
- [x] message on community.jenkins.io
[x] Once the deadline is reached: update the DNS (existing!) records in Azure (either manually or in jenkins-infra/azure if DNS records have been imported) to CNAME to the public DNS associated with the ingresses
[x] Update the Puppet repository to remove the mirrorbrain profiles
[x] Cleanup the VM from the Apache former vhosts + postgresql (+ any resource from the mirrorbrain profile)

dduportal commented 2 years ago

Ping @daniel-beck @MarkEWaite @olblak @lemeurherve @timja @halkeye @jnord @jglick for info, review and advise (If I forgot anything)

halkeye commented 2 years ago

Code in the GitHub organization jenkinsci (pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Jenkins users that are not able to use HTTPS

are they still able to? or will be we killing that access path?

olblak commented 2 years ago

"Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)"

What do you think to just deprecated this DNS record. Officially it's not used anymore, or used it as a the k8s cluster fallback. you would cleanly deploy mirrorbits on that machine pkg.jenkins.io so if something goes wrong with the k8s cluster, you still have it working.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

dduportal commented 2 years ago

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Thanks for the tip! It confirm that what we did in #2040 was correct. For information, https://github.com/jenkins-infra/evergreen is marked as "archived" repository

Jenkins users that are not able to use HTTPS are they still able to? or will be we killing that access path?

They are still able to, and we'll kill this access path as it implies force a redirect to https.

If mirrors.jenkins.io or mirrors.jenkins-ci.org is used to download any file (war, plugin, or package), then it is only HTTP (there is not vhost for these domain at all, no certificates and defaults to https://pkg.origin.jenkins.io/ - with an expected TLS security alert for domain mismatch).

What do you think to just deprecated this DNS record.

Thanks for the tip! You know that I like deleting things ;) But it might be a bit too harsh to kill this domain. Using a CNAME to get.jenkins.io would allow a smooth transition. Once we tracked as much usages (such as code in jenkinscu GH org) as we can and switched them to get.jenkins.io, then we can track access for a 2-3 months to see what usage is done and decide of killing it maybe at that time.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

Good reminder! That we'll be the next subject. The current get.jenkins.io, which is kubernetes cluster wide, is still more available than the mirrorbrain on its alone VM. I don't know for response time though. So once mirrorbrain is killed, then we'll check the fallback solution for DRS of the kubernetes cluster.

dduportal commented 2 years ago

Opened the PR https://github.com/jenkins-infra/pipeline-library/pull/374 in the shared library + notified with an email on the dev mailing list https://groups.google.com/g/jenkinsci-dev/c/anTCx9Q6mLI

dduportal commented 2 years ago

Thanks @MarkEWaite and @timja for https://github.com/jenkinsci/jep/pull/386 on this area!

dduportal commented 2 years ago

Another PR on the PCT: https://github.com/jenkinsci/plugin-compat-tester/pull/363

dduportal commented 2 years ago

Other references found on the github.com/jenkinsci organization are not worth the changes (README or deprecated projects such as evergreen)

dduportal commented 2 years ago

As per @MarkEWaite messages in the #jenkins-infra IRC channel:

Been receiving alerts that updates.jenkins.io is slow to respond. The pkg.jenkins.io top output shows postgres heavily loaded. Stopping and restarting Apache in hopes that reduces load Disc use on the /dev/xvda1 disc is at 87%. Vaccuumed the logs from using 4 GB to using 1 GB and didn't change the disc use percentage at all. We may need to expand the disc on that machine or remove more services

Opening maintenance window on status.jenkins.io: https://github.com/jenkins-infra/status/pull/157

dduportal commented 2 years ago

Resized the root volume from 1000 to 1200 Gb:

Took a snapshot of the disk as an AMI with today's date (in case something goes wrong)
Stopped the instance
Increase the EBS root volume size to 1200
Restarted the instance

The file system was automatically resized:

$ df -hT / # Right after reboot
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     ext4  1.2T  811G  323G  72% /

dduportal commented 2 years ago

Failed to change the instance size:

Today, we are using an m4.2xlarge VM (ref. https://aws.amazon.com/ec2/instance-types/). This instance type features a 8vCPUS 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors or 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors. Its rate is 0.40$ per hour (~ 295 $ per month).

$ cat /proc/cpuinfo  | grep Xeon | sort | uniq
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
$ grep -c processor /proc/cpuinfo
8

The idea was to try to migrate to a new instance size that would benefit from:

Better CPU: new generation of Xeon or AMD EPYC (increase peak and clock performances, new instruction set, better core management)
Increase network bandwidth
Decrease costs

Check the following table to compare instance types, with the following rules:

Same amount of vCPU
Accepts 16 Gb or more (currently 32 Gb but only 6 to 10 are used)
Only "General Purpose" or "Compute Optimized" families, as this VM is bound to network and CPU (I/O and memory are negligible)

Instance Type	CPU Family	vCPUs	Memory	Network Bandwidth	EBS Bandwidth	Hourl Rate (on-demand)
m4.2xlarge (Current)	Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz	8	32	Up to 10 Gbps	1,000 Mbps	$0.40
m5.2xlarge	3.1 GHz Intel Xeon® Platinum 8175M	8	32	Up to 10 Gbps	Up to 4,750 Mbps	$0.384
c6i.2xlarge	3.5 GHz 3rd generation Intel Xeon	8	16	Up to 12,5 Gbps	Up to 10,000 Mbps	$0.34
m5a.2xlarge	AMD EPYC 7000 series 2.5 GHz	8	32	Up to 10 Gbps	Up to 2,880 Mpbs	$0.344
c6i.2xlarge	3.5 GHz 3rd generation Intel Xeon	8	16	Up to 12.5 Gbps	Up to 6,600 Mpbs	$0.34

Alas, each try to change the instance type ended up in an error message "configuration not documented" when starting the instance.

Tried to enabled the "Enhanced Networking Adapter" did not change anything (but it is enabled now):

$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[]
$ aws ec2 modify-instance-attribute --instance-id i-e0968e19 --ena-support --region us-east-1
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[
  true
]

Let's keep this instance size for now: the AMI snapshot could be used to try creating a new instance but better putting our effort in https://github.com/jenkins-infra/helpdesk/issues/2649

dduportal commented 2 years ago

While trying to "short-term" workaround with the high CPU usage on this machine, stumbled across the following error message in Apache error logs:

AH00632: failed to prepare SQL statements: ERROR:  relation "pfx2asn" does not exist\nLINE 1: ...EPARE asn_dbd_1 (varchar) AS SELECT pfx, asn FROM pfx2asn WH...\n

This error is related to the mirrorbrain installation:

Missing table in the PgSQL database
This table is related to the mod_asn
Should be created during the mirrorbrain installation, along with the Ubuntu APT package postgresql-*-ip4r

But this machine is a mess: there was 3 different postgresql server installations, each one on a different port:

postgresql-9.3, port 5432, used by mirrorbrain. The "production"
postgresql-9.5, port 5433, with a copy of the database from 2021. Smells like a tentative update, or an incomplete puppet run when @MarkEWaite and I ensured that this Ubunt! was fully 18.04.
postgresql-10, port 5433, but stopped (conflict with postgresql-9.3), which is the default version for Ubuntu 18.04, installed with the apt-get dist-upgrade operations.

Since this VM is not managed by puppet since some time, the following operation where done manually:

Fully migrate the PostgreSQL instance to postgresql 10, to allow installation of the only ip4r postgresl package postgresql-10-ip4r

# Ensure postgresql 10 is installed properly
$ apt-get -y install postgresql-10
$ dpkg --get-selections | grep postgresql # Sanity check

# Migrate the actual 9.3 cluster named `main` to version 10 with the same name
$ pg_lsclusters
$ pg_renamecluster 10 main main_ver10
$ pg_lsclusters # Sanity check
$ systemctl stop postgresql@9.3-main.service 
$ pg_upgradecluster 9.3 main # Restarts the instance once done
$ pg_lsclusters # Sanity check

Cleaned up old postgresql versions

## Cleanup
$ pg_dropcluster --stop 9.3 main
$ pg_dropcluster --stop 10 main_ver10
$ pg_dropcluster --stop 9.5 main
$ apt-get remove --purge postgresql-9.3 postgresql-client-9.3 postgresql-9.5 postgresql-client-9.5
$ dpkg --get-selections | grep postgresql # Sanity check

Installed and configured ip4r in the database (as it was missing)

# Ensure ip4r is installed properly
$ apt-get -y install postgresql-contrib postgresql-10-ip4r

# Create extension in the pgsql instance, as Pg superuser
$ su - postgres
$ psql # Top-level
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
$ psql --dbname=jenkins_mirrorbrain_db # On the mirrorbrain database
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q

# Load the ASN script, now that the primitive type `iprange` is provided by the ip4r extension
$ psql --host=localhost --username=jenkins_mirrorbrain --password --dbname=jenkins_mirrorbrain_db --file=/usr/share/doc/libapache2-mod-asn/asn.sql
password: <redacted>

# Ensure everything is loaded and available
$ apt update && apt-get dist-upgrade && apt-get autoremove --purge && update-grub && reboot

Ensure that error message does not appears anymore on apache logs:

$ tail -f /var/log/apache2/*log

dduportal commented 2 years ago

Another error on the apache log, but no solution for now:

[Sat May 07 10:47:48.548369 2022] [mpm_event:error] [pid 1651:tid 140147096673216] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

Sounds related to https://www.claudiokuenzler.com/blog/948/apache-2.4-mpm-event-bug-freezing-up-scoreboard-full-after-reload (yes we are using MPM event, and the /server-status shows a lot of Apache threads in a G state for loooong time.

In order to help on this area, installed sysstat to provide a finer metric grain

$ apt-get update -q && apt-get install -y sysstat
$ vi /etc/default/sysstat # changed `ENABLED` to `true`
$ vi /etc/cron.d/sysstat # changed to collection every 2 min
$ systemctl enable sysstat
$ systemctl start sysstat

It appears that there are peaks of CPU on %system when the slowness appears:

10:00:01 AM     all      8.87      0.00      3.10      0.06      0.12     87.86
10:02:01 AM     all     28.04      0.00      4.75      0.27      0.14     66.79
10:04:01 AM     all     26.21      0.00      4.87      0.15      0.21     68.55
10:06:01 AM     all     33.32      0.00     12.11      0.13      1.68     52.77
10:08:01 AM     all     30.51      0.00     11.64      0.08      1.68     56.08
10:10:01 AM     all     27.46      0.00     13.96      0.05      1.72     56.81
10:12:01 AM     all     30.69      0.00     13.89      0.11      1.66     53.66
10:14:01 AM     all     30.90      0.00     11.48      0.11      1.69     55.82
10:16:01 AM     all     27.94      0.00     13.86      0.08      1.71     56.41
10:18:01 AM     all     29.40      0.00     14.48      0.07      1.66     54.39
10:20:01 AM     all     27.84      0.00     13.03      0.06      1.72     57.35
10:22:01 AM     all     23.33      0.00      4.35      0.14      0.23     71.96
10:24:01 AM     all     21.31      0.00      3.50      0.06      0.11     75.01

We might check the configuration history:

When @olblak and I decreased the instance size from 16 to 8 vCPUs, we might have failed to update the MPM worker threads configuration
- Check and fine tune actual to 8 vCPUs?
- Go back to 16 vCPUs (but on a new instance generation)
Maybe MPM event is not the best solution with Apache 2.4: considering switching to MPM prefork

dduportal commented 2 years ago

Let's see how the machine behaves with the postgresql + ip4r fix.

dduportal commented 2 years ago

Merged the PR on the pipeline library: let's monitor the upcoming Jenkins core, ATH and bom builds.

dduportal commented 2 years ago

DNS record mirrors.jenkins.io changed from IN A 52.202.51.185 to CNAME get.jenkins.io. (TTL 1 min) today at ~08:10am UTC

dduportal commented 2 years ago

Outage on updates.jenkins.io, consecutively to this change: the DNS record updates.jenkins.io was a CNAME to mirrors.jenkins.io (reported in IRC around ~08:47am in the gitter channel jenkins/jenkins by a user)
- DNS record updates.jenkins.io change to IN A 52.202.51.185 around 09:00am UTC` and TTL was changed from 1 hour to 1 minute
- It was an uplanned side effect. We should have checked this DNS before. Expect 1 hour for DNS caches to update. Until then, updates.jenkins.io is considered full outage (because redirected to the Kubernetes cluster until DNS)

dduportal commented 2 years ago

Starting maintenance on the VM:

Checking logs of the service mirrors.jenkins.io to be sure
Snapshoting the VM for backup
Stop postgresql and mirrorbrain service, wait 1 hour and clean it up if no error

In parallel, https://github.com/jenkins-infra/status/pull/166 was opened to prepare puppet so we can put this machine under automatic puppet management again.

dduportal commented 2 years ago

Ran the following command on the VM (after snapshoting + backuping postgres data):

apt-get remove --purge postgresql-10 postgresql-10-ip4r postgresql-client-10 postgresql-client-common postgresql-common postgresql-contrib mirmon mirrorbrain mirrorbrain-scanner mirrorbrain-tools
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  formencode-i18n libalgorithm-c3-perl libauthen-sasl-perl libb-hooks-endofscope-perl libclass-c3-perl libclass-c3-xs-perl libclass-data-inheritable-perl
  libclass-inspector-perl libclass-method-modifiers-perl libclass-singleton-perl libconfig-inifiles-perl libdata-dump-perl libdata-optlist-perl
  libdatetime-locale-perl libdatetime-perl libdatetime-timezone-perl libdbd-pg-perl libdbi-perl libdevel-caller-perl libdevel-lexalias-perl
  libdevel-stacktrace-perl libdigest-md4-perl libencode-locale-perl libeval-closure-perl libexception-class-perl libfile-listing-perl libfile-sharedir-perl
  libfont-afm-perl libhtml-form-perl libhtml-format-perl libhtml-parser-perl libhtml-tagset-perl libhtml-tree-perl libhttp-cookies-perl libhttp-daemon-perl
  libhttp-date-perl libhttp-message-perl libhttp-negotiate-perl libio-html-perl libio-socket-inet6-perl libio-socket-ssl-perl liblwp-mediatypes-perl
  liblwp-protocol-https-perl libmailtools-perl libmodule-implementation-perl libmro-compat-perl libnamespace-autoclean-perl libnamespace-clean-perl
  libnet-http-perl libnet-smtp-ssl-perl libnet-ssleay-perl libpackage-stash-perl libpackage-stash-xs-perl libpadwalker-perl libparams-util-perl
  libparams-validationcompiler-perl libreadonly-perl libref-util-perl libref-util-xs-perl librole-tiny-perl libsocket6-perl libspecio-perl
  libsub-exporter-perl libsub-exporter-progressive-perl libsub-identify-perl libsub-install-perl libsub-quote-perl libtry-tiny-perl liburi-perl
  libvariable-magic-perl libwww-perl libwww-robotrules-perl perl-openssl-defaults python-cmdln python-dnspython python-formencode python-mb
  python-pkg-resources python-pydispatch python-sqlobject python3-dnspython python3-formencode python3-pydispatch python3-sqlobject sqlobject-admin
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  mirmon* mirrorbrain* mirrorbrain-scanner* mirrorbrain-tools* postgresql-10* postgresql-10-ip4r* postgresql-client-10* postgresql-client-common*
  postgresql-common* postgresql-contrib*
0 upgraded, 0 newly installed, 10 to remove and 3 not upgraded.
After this operation, 20.9 MB disk space will be freed.

followed by the autoremove.

Also, removed manually all the apache vhost configurations (after backuping it) related to domains mirrors.jenkins or get.jenkins.io.

Still some apache config to clean up

dduportal commented 2 years ago

Cleaned up any remnant of mirrorbrain / mirrors on the VM
Enabled again puppet management with a new agent name pkg (with the whole puppet certificate regeneration)
Ran a dry run of the puppet agent, and backuped all files touched
Puppet apply ran successfully
VM rebooted with puppet enabled

dduportal commented 2 years ago

https://github.com/jenkins-infra/status/pull/167

dduportal commented 2 years ago

Just in case : backups of apache2 etc and var are in the /root if anything breaks + there is a snapshot of the vm root volume in aws

dduportal commented 2 years ago

Summary of the past days:

Enabling Puppet on the VM broke the pkg service as shown in (at least) https://github.com/jenkins-infra/helpdesk/issues/2957 in numerou ways, as a 2-year-not-up-to-date config was deployed :'(
- No more http -> https redirection (to be fixed) on the apache service but a fixed was applied by a contribuor by specifying all the repo to https - Fixed in https://github.com/jenkins-infra/jenkins-infra/pull/2192
- vhost +certificates for pkg.origin.jenkins.io broken - Fixed in https://github.com/jenkins-infra/jenkins-infra/pull/2188
- the former reposiotry public key was still Koshuke's in the puppet code - Fixed in https://github.com/jenkins-infra/jenkins-infra/pull/2195

A lot of people help, and I'm really glad for it!

Next step:

Ensuring that http to https redirection is need and if the case is it enforced for the pkg and update services
Ensure that nothing else is broken

dduportal commented 2 years ago

Yet another incident due to this issue: https://github.com/jenkins-infra/helpdesk/issues/2960

dduportal commented 2 years ago

Closing as the incidents seems to be gone (all of them).

jenkins-infra / helpdesk