jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

(mirrors.jenkins.io/http://mirrors.jenkins-ci.org/) Sunset the legacy "mirrorbrain" service in favor of get.jenkins.io #2888

Closed dduportal closed 2 years ago

dduportal commented 2 years ago

Service(s)

Update center, Other

Summary

What Happened

Since 4 weeks, the infra team receives the following pager duty alert: Weird Response time https://updates.jenkins-ci.org multiple times a day.

Click to see details The alerts are triggered by a threshold in the datadog metrics collection for this service: https://github.com/jenkins-infra/docker-datadog/blob/main/conf.d/http_check.d/jenkins.yaml#L137-L148. As shown in the screenshots, it means that the average HTTP response time is increased past 10s most of the time (when the alert is triggered). Capture d’écran 2022-04-15 à 11 19 59 Capture d’écran 2022-04-15 à 11 19 35

Most of the time, the alert acknowledge itself as the response time decreased. Sometimes, the person on duty (@MarkEWaite or I) have to SSH to the machine pkg.origin.jenkins.io and restart the Apache server (rebooting the machine would be the last option).

Root cause

The (legacy) service referenced as mirrorbrain (hosting the services mirrors.jenkins.io and mirrors.jenkins-ci.org), also hosted on this VM is causing a peak of CPU usage which slows done the other service updates.jenkins.io.

Click to expand for details on the configuration as code Puppet configuration audit trail: - VM definition: https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122 - This VM has the role `mirrorbrain`: https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/role/manifests/mirrorbrain.pp#L4-L7 - This role is composed of 4 profiles: - `base` (as all VMs managed by Puppet): https://github.com/jenkins-infra/jenkins-infra/blob/7c6d6609b650f1ef209cd590dd4568bcc676514c/dist/profile/manifests/base.pp - `mirrorbrain` (which defines `mirrors.jenkins*` services, that we want to sunset): [mirrorbrain](https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/mirrorbrain.pp) - `updatesite` (which defines the update center site, causing alerts because slowed down): https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/updatesite.pp - `pkgrepo` (used to build and host the Jenkins packages for debian/centos/etc. to be replaced later but not part of this issue: keep it for now) - https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122

Proposal

Let's sunset the legacy service mirrorbrain in favor of the current get.jenkins.io modern mirror service based on mirrorbits!

Rationale:

Click to expand for details about the mirrorbits service - Mirrorbits Helm Chart: https://github.com/jenkins-infra/helm-charts/tree/main/charts/mirrorbits - Configuration of get.jenkins.io (installation of the mirrorbits chart): - Helmfile Manifest at https://github.com/jenkins-infra/kubernetes-management/blob/main/clusters/prodpublick8s.yaml#L180-L189 - Custom values: https://github.com/jenkins-infra/kubernetes-management/blob/main/config/mirrorbits.yaml

In order to NOT break end-users installations, the domains mirrors.jenkins.io and mirrors.jenkins-ci.org should be CNAMEs to the mirrorbits new system.

Known usages of the legacy mirror system

To Do List

dduportal commented 2 years ago

Ping @daniel-beck @MarkEWaite @olblak @lemeurherve @timja @halkeye @jnord @jglick for info, review and advise (If I forgot anything)

halkeye commented 2 years ago

Code in the GitHub organization jenkinsci (pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Jenkins users that are not able to use HTTPS

are they still able to? or will be we killing that access path?

olblak commented 2 years ago

"Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)"

What do you think to just deprecated this DNS record. Officially it's not used anymore, or used it as a the k8s cluster fallback. you would cleanly deploy mirrorbits on that machine pkg.jenkins.io so if something goes wrong with the k8s cluster, you still have it working.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

dduportal commented 2 years ago

evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways

Thanks for the tip! It confirm that what we did in #2040 was correct. For information, https://github.com/jenkins-infra/evergreen is marked as "archived" repository

Jenkins users that are not able to use HTTPS are they still able to? or will be we killing that access path?

They are still able to, and we'll kill this access path as it implies force a redirect to https.

If mirrors.jenkins.io or mirrors.jenkins-ci.org is used to download any file (war, plugin, or package), then it is only HTTP (there is not vhost for these domain at all, no certificates and defaults to https://pkg.origin.jenkins.io/ - with an expected TLS security alert for domain mismatch).

What do you think to just deprecated this DNS record.

Thanks for the tip! You know that I like deleting things ;) But it might be a bit too harsh to kill this domain. Using a CNAME to get.jenkins.io would allow a smooth transition. Once we tracked as much usages (such as code in jenkinscu GH org) as we can and switched them to get.jenkins.io, then we can track access for a 2-3 months to see what usage is done and decide of killing it maybe at that time.

Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime

Good reminder! That we'll be the next subject. The current get.jenkins.io, which is kubernetes cluster wide, is still more available than the mirrorbrain on its alone VM. I don't know for response time though. So once mirrorbrain is killed, then we'll check the fallback solution for DRS of the kubernetes cluster.

dduportal commented 2 years ago

Opened the PR https://github.com/jenkins-infra/pipeline-library/pull/374 in the shared library + notified with an email on the dev mailing list https://groups.google.com/g/jenkinsci-dev/c/anTCx9Q6mLI

dduportal commented 2 years ago

Thanks @MarkEWaite and @timja for https://github.com/jenkinsci/jep/pull/386 on this area!

dduportal commented 2 years ago

Another PR on the PCT: https://github.com/jenkinsci/plugin-compat-tester/pull/363

dduportal commented 2 years ago

Other references found on the github.com/jenkinsci organization are not worth the changes (README or deprecated projects such as evergreen)

dduportal commented 2 years ago

As per @MarkEWaite messages in the #jenkins-infra IRC channel:

Been receiving alerts that updates.jenkins.io is slow to respond. The pkg.jenkins.io top output shows postgres heavily loaded. Stopping and restarting Apache in hopes that reduces load Disc use on the /dev/xvda1 disc is at 87%. Vaccuumed the logs from using 4 GB to using 1 GB and didn't change the disc use percentage at all. We may need to expand the disc on that machine or remove more services

Opening maintenance window on status.jenkins.io: https://github.com/jenkins-infra/status/pull/157

dduportal commented 2 years ago

Resized the root volume from 1000 to 1200 Gb:

The file system was automatically resized:

$ df -hT / # Right after reboot
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/xvda1     ext4  1.2T  811G  323G  72% /
dduportal commented 2 years ago

Failed to change the instance size:

Today, we are using an m4.2xlarge VM (ref. https://aws.amazon.com/ec2/instance-types/). This instance type features a 8vCPUS 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors or 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors. Its rate is 0.40$ per hour (~ 295 $ per month).

$ cat /proc/cpuinfo  | grep Xeon | sort | uniq
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
$ grep -c processor /proc/cpuinfo
8

The idea was to try to migrate to a new instance size that would benefit from:

Check the following table to compare instance types, with the following rules:

Instance Type CPU Family vCPUs Memory Network Bandwidth EBS Bandwidth Hourl Rate (on-demand)
m4.2xlarge (Current) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz 8 32 Up to 10 Gbps 1,000 Mbps $0.40
m5.2xlarge 3.1 GHz Intel Xeon® Platinum 8175M 8 32 Up to 10 Gbps Up to 4,750 Mbps $0.384
c6i.2xlarge 3.5 GHz 3rd generation Intel Xeon 8 16 Up to 12,5 Gbps Up to 10,000 Mbps $0.34
m5a.2xlarge AMD EPYC 7000 series 2.5 GHz 8 32 Up to 10 Gbps Up to 2,880 Mpbs $0.344
c6i.2xlarge 3.5 GHz 3rd generation Intel Xeon 8 16 Up to 12.5 Gbps Up to 6,600 Mpbs $0.34

Alas, each try to change the instance type ended up in an error message "configuration not documented" when starting the instance.

Tried to enabled the "Enhanced Networking Adapter" did not change anything (but it is enabled now):

$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[]
$ aws ec2 modify-instance-attribute --instance-id i-e0968e19 --ena-support --region us-east-1
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[
  true
]

Let's keep this instance size for now: the AMI snapshot could be used to try creating a new instance but better putting our effort in https://github.com/jenkins-infra/helpdesk/issues/2649

dduportal commented 2 years ago

While trying to "short-term" workaround with the high CPU usage on this machine, stumbled across the following error message in Apache error logs:

AH00632: failed to prepare SQL statements: ERROR:  relation "pfx2asn" does not exist\nLINE 1: ...EPARE asn_dbd_1 (varchar) AS SELECT pfx, asn FROM pfx2asn WH...\n

This error is related to the mirrorbrain installation:

But this machine is a mess: there was 3 different postgresql server installations, each one on a different port:

Since this VM is not managed by puppet since some time, the following operation where done manually:

# Ensure postgresql 10 is installed properly
$ apt-get -y install postgresql-10
$ dpkg --get-selections | grep postgresql # Sanity check

# Migrate the actual 9.3 cluster named `main` to version 10 with the same name
$ pg_lsclusters
$ pg_renamecluster 10 main main_ver10
$ pg_lsclusters # Sanity check
$ systemctl stop postgresql@9.3-main.service 
$ pg_upgradecluster 9.3 main # Restarts the instance once done
$ pg_lsclusters # Sanity check
## Cleanup
$ pg_dropcluster --stop 9.3 main
$ pg_dropcluster --stop 10 main_ver10
$ pg_dropcluster --stop 9.5 main
$ apt-get remove --purge postgresql-9.3 postgresql-client-9.3 postgresql-9.5 postgresql-client-9.5
$ dpkg --get-selections | grep postgresql # Sanity check
# Ensure ip4r is installed properly
$ apt-get -y install postgresql-contrib postgresql-10-ip4r

# Create extension in the pgsql instance, as Pg superuser
$ su - postgres
$ psql # Top-level
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
$ psql --dbname=jenkins_mirrorbrain_db # On the mirrorbrain database
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q

# Load the ASN script, now that the primitive type `iprange` is provided by the ip4r extension
$ psql --host=localhost --username=jenkins_mirrorbrain --password --dbname=jenkins_mirrorbrain_db --file=/usr/share/doc/libapache2-mod-asn/asn.sql
password: <redacted>

# Ensure everything is loaded and available
$ apt update && apt-get dist-upgrade && apt-get autoremove --purge && update-grub && reboot
$ tail -f /var/log/apache2/*log
dduportal commented 2 years ago

Another error on the apache log, but no solution for now:

[Sat May 07 10:47:48.548369 2022] [mpm_event:error] [pid 1651:tid 140147096673216] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.

Sounds related to https://www.claudiokuenzler.com/blog/948/apache-2.4-mpm-event-bug-freezing-up-scoreboard-full-after-reload (yes we are using MPM event, and the /server-status shows a lot of Apache threads in a G state for loooong time.

In order to help on this area, installed sysstat to provide a finer metric grain

$ apt-get update -q && apt-get install -y sysstat
$ vi /etc/default/sysstat # changed `ENABLED` to `true`
$ vi /etc/cron.d/sysstat # changed to collection every 2 min
$ systemctl enable sysstat
$ systemctl start sysstat

It appears that there are peaks of CPU on %system when the slowness appears:

10:00:01 AM     all      8.87      0.00      3.10      0.06      0.12     87.86
10:02:01 AM     all     28.04      0.00      4.75      0.27      0.14     66.79
10:04:01 AM     all     26.21      0.00      4.87      0.15      0.21     68.55
10:06:01 AM     all     33.32      0.00     12.11      0.13      1.68     52.77
10:08:01 AM     all     30.51      0.00     11.64      0.08      1.68     56.08
10:10:01 AM     all     27.46      0.00     13.96      0.05      1.72     56.81
10:12:01 AM     all     30.69      0.00     13.89      0.11      1.66     53.66
10:14:01 AM     all     30.90      0.00     11.48      0.11      1.69     55.82
10:16:01 AM     all     27.94      0.00     13.86      0.08      1.71     56.41
10:18:01 AM     all     29.40      0.00     14.48      0.07      1.66     54.39
10:20:01 AM     all     27.84      0.00     13.03      0.06      1.72     57.35
10:22:01 AM     all     23.33      0.00      4.35      0.14      0.23     71.96
10:24:01 AM     all     21.31      0.00      3.50      0.06      0.11     75.01

We might check the configuration history:

dduportal commented 2 years ago

Let's see how the machine behaves with the postgresql + ip4r fix.

dduportal commented 2 years ago

Merged the PR on the pipeline library: let's monitor the upcoming Jenkins core, ATH and bom builds.

dduportal commented 2 years ago
dduportal commented 2 years ago
dduportal commented 2 years ago

Starting maintenance on the VM:

In parallel, https://github.com/jenkins-infra/status/pull/166 was opened to prepare puppet so we can put this machine under automatic puppet management again.

dduportal commented 2 years ago

Ran the following command on the VM (after snapshoting + backuping postgres data):

apt-get remove --purge postgresql-10 postgresql-10-ip4r postgresql-client-10 postgresql-client-common postgresql-common postgresql-contrib mirmon mirrorbrain mirrorbrain-scanner mirrorbrain-tools
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  formencode-i18n libalgorithm-c3-perl libauthen-sasl-perl libb-hooks-endofscope-perl libclass-c3-perl libclass-c3-xs-perl libclass-data-inheritable-perl
  libclass-inspector-perl libclass-method-modifiers-perl libclass-singleton-perl libconfig-inifiles-perl libdata-dump-perl libdata-optlist-perl
  libdatetime-locale-perl libdatetime-perl libdatetime-timezone-perl libdbd-pg-perl libdbi-perl libdevel-caller-perl libdevel-lexalias-perl
  libdevel-stacktrace-perl libdigest-md4-perl libencode-locale-perl libeval-closure-perl libexception-class-perl libfile-listing-perl libfile-sharedir-perl
  libfont-afm-perl libhtml-form-perl libhtml-format-perl libhtml-parser-perl libhtml-tagset-perl libhtml-tree-perl libhttp-cookies-perl libhttp-daemon-perl
  libhttp-date-perl libhttp-message-perl libhttp-negotiate-perl libio-html-perl libio-socket-inet6-perl libio-socket-ssl-perl liblwp-mediatypes-perl
  liblwp-protocol-https-perl libmailtools-perl libmodule-implementation-perl libmro-compat-perl libnamespace-autoclean-perl libnamespace-clean-perl
  libnet-http-perl libnet-smtp-ssl-perl libnet-ssleay-perl libpackage-stash-perl libpackage-stash-xs-perl libpadwalker-perl libparams-util-perl
  libparams-validationcompiler-perl libreadonly-perl libref-util-perl libref-util-xs-perl librole-tiny-perl libsocket6-perl libspecio-perl
  libsub-exporter-perl libsub-exporter-progressive-perl libsub-identify-perl libsub-install-perl libsub-quote-perl libtry-tiny-perl liburi-perl
  libvariable-magic-perl libwww-perl libwww-robotrules-perl perl-openssl-defaults python-cmdln python-dnspython python-formencode python-mb
  python-pkg-resources python-pydispatch python-sqlobject python3-dnspython python3-formencode python3-pydispatch python3-sqlobject sqlobject-admin
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  mirmon* mirrorbrain* mirrorbrain-scanner* mirrorbrain-tools* postgresql-10* postgresql-10-ip4r* postgresql-client-10* postgresql-client-common*
  postgresql-common* postgresql-contrib*
0 upgraded, 0 newly installed, 10 to remove and 3 not upgraded.
After this operation, 20.9 MB disk space will be freed.

followed by the autoremove.

Also, removed manually all the apache vhost configurations (after backuping it) related to domains mirrors.jenkins or get.jenkins.io.

Still some apache config to clean up

dduportal commented 2 years ago
dduportal commented 2 years ago

https://github.com/jenkins-infra/status/pull/167

dduportal commented 2 years ago

Just in case : backups of apache2 etc and var are in the /root if anything breaks + there is a snapshot of the vm root volume in aws

dduportal commented 2 years ago

Summary of the past days:

A lot of people help, and I'm really glad for it!

Next step:

dduportal commented 2 years ago

Yet another incident due to this issue: https://github.com/jenkins-infra/helpdesk/issues/2960

dduportal commented 2 years ago

Closing as the incidents seems to be gone (all of them).