Closed dduportal closed 2 years ago
Ping @daniel-beck @MarkEWaite @olblak @lemeurherve @timja @halkeye @jnord @jglick for info, review and advise (If I forgot anything)
Code in the GitHub organization
jenkinsci
(pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:
evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways
Jenkins users that are not able to use HTTPS
are they still able to? or will be we killing that access path?
"Add ingresses for the domains mirror.jenkins.io and mirrors.jenkins-ci.org in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)"
What do you think to just deprecated this DNS record. Officially it's not used anymore, or used it as a the k8s cluster fallback. you would cleanly deploy mirrorbits on that machine pkg.jenkins.io so if something goes wrong with the k8s cluster, you still have it working.
Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime
evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways
Thanks for the tip! It confirm that what we did in #2040 was correct. For information, https://github.com/jenkins-infra/evergreen is marked as "archived" repository
Jenkins users that are not able to use HTTPS are they still able to? or will be we killing that access path?
They are still able to, and we'll kill this access path as it implies force a redirect to https.
If mirrors.jenkins.io or mirrors.jenkins-ci.org is used to download any file (war, plugin, or package), then it is only HTTP (there is not vhost for these domain at all, no certificates and defaults to https://pkg.origin.jenkins.io/ - with an expected TLS security alert for domain mismatch).
What do you think to just deprecated this DNS record.
Thanks for the tip! You know that I like deleting things ;) But it might be a bit too harsh to kill this domain. Using a CNAME to get.jenkins.io would allow a smooth transition. Once we tracked as much usages (such as code in jenkinscu GH org) as we can and switched them to get.jenkins.io, then we can track access for a 2-3 months to see what usage is done and decide of killing it maybe at that time.
Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime
Good reminder! That we'll be the next subject. The current get.jenkins.io, which is kubernetes cluster wide, is still more available than the mirrorbrain
on its alone VM. I don't know for response time though. So once mirrorbrain is killed, then we'll check the fallback solution for DRS of the kubernetes cluster.
Opened the PR https://github.com/jenkins-infra/pipeline-library/pull/374 in the shared library + notified with an email on the dev mailing list https://groups.google.com/g/jenkinsci-dev/c/anTCx9Q6mLI
Thanks @MarkEWaite and @timja for https://github.com/jenkinsci/jep/pull/386 on this area!
Another PR on the PCT: https://github.com/jenkinsci/plugin-compat-tester/pull/363
Other references found on the github.com/jenkinsci organization are not worth the changes (README or deprecated projects such as evergreen)
As per @MarkEWaite messages in the #jenkins-infra IRC channel:
Been receiving alerts that updates.jenkins.io is slow to respond. The pkg.jenkins.io
top
output shows postgres heavily loaded. Stopping and restarting Apache in hopes that reduces load Disc use on the /dev/xvda1 disc is at 87%. Vaccuumed the logs from using 4 GB to using 1 GB and didn't change the disc use percentage at all. We may need to expand the disc on that machine or remove more services
Opening maintenance window on status.jenkins.io: https://github.com/jenkins-infra/status/pull/157
Resized the root volume from 1000 to 1200 Gb:
The file system was automatically resized:
$ df -hT / # Right after reboot
Filesystem Type Size Used Avail Use% Mounted on
/dev/xvda1 ext4 1.2T 811G 323G 72% /
Failed to change the instance size:
Today, we are using an m4.2xlarge
VM (ref. https://aws.amazon.com/ec2/instance-types/). This instance type features a 8vCPUS 2.3 GHz Intel Xeon® E5-2686 v4 (Broadwell) processors
or 2.4 GHz Intel Xeon® E5-2676 v3 (Haswell) processors
. Its rate is 0.40$
per hour (~ 295 $ per month).
$ cat /proc/cpuinfo | grep Xeon | sort | uniq
model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
$ grep -c processor /proc/cpuinfo
8
The idea was to try to migrate to a new instance size that would benefit from:
Check the following table to compare instance types, with the following rules:
Instance Type | CPU Family | vCPUs | Memory | Network Bandwidth | EBS Bandwidth | Hourl Rate (on-demand) |
---|---|---|---|---|---|---|
m4.2xlarge (Current) | Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz | 8 | 32 | Up to 10 Gbps | 1,000 Mbps | $0.40 |
m5.2xlarge | 3.1 GHz Intel Xeon® Platinum 8175M | 8 | 32 | Up to 10 Gbps | Up to 4,750 Mbps | $0.384 |
c6i.2xlarge | 3.5 GHz 3rd generation Intel Xeon | 8 | 16 | Up to 12,5 Gbps | Up to 10,000 Mbps | $0.34 |
m5a.2xlarge | AMD EPYC 7000 series 2.5 GHz | 8 | 32 | Up to 10 Gbps | Up to 2,880 Mpbs | $0.344 |
c6i.2xlarge | 3.5 GHz 3rd generation Intel Xeon | 8 | 16 | Up to 12.5 Gbps | Up to 6,600 Mpbs | $0.34 |
Alas, each try to change the instance type ended up in an error message "configuration not documented" when starting the instance.
Tried to enabled the "Enhanced Networking Adapter" did not change anything (but it is enabled now):
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[]
$ aws ec2 modify-instance-attribute --instance-id i-e0968e19 --ena-support --region us-east-1
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[
true
]
Let's keep this instance size for now: the AMI snapshot could be used to try creating a new instance but better putting our effort in https://github.com/jenkins-infra/helpdesk/issues/2649
While trying to "short-term" workaround with the high CPU usage on this machine, stumbled across the following error message in Apache error logs:
AH00632: failed to prepare SQL statements: ERROR: relation "pfx2asn" does not exist\nLINE 1: ...EPARE asn_dbd_1 (varchar) AS SELECT pfx, asn FROM pfx2asn WH...\n
This error is related to the mirrorbrain
installation:
postgresql-*-ip4r
But this machine is a mess: there was 3 different postgresql server installations, each one on a different port:
apt-get dist-upgrade
operations.Since this VM is not managed by puppet since some time, the following operation where done manually:
postgresql-10-ip4r
# Ensure postgresql 10 is installed properly
$ apt-get -y install postgresql-10
$ dpkg --get-selections | grep postgresql # Sanity check
# Migrate the actual 9.3 cluster named `main` to version 10 with the same name
$ pg_lsclusters
$ pg_renamecluster 10 main main_ver10
$ pg_lsclusters # Sanity check
$ systemctl stop postgresql@9.3-main.service
$ pg_upgradecluster 9.3 main # Restarts the instance once done
$ pg_lsclusters # Sanity check
## Cleanup
$ pg_dropcluster --stop 9.3 main
$ pg_dropcluster --stop 10 main_ver10
$ pg_dropcluster --stop 9.5 main
$ apt-get remove --purge postgresql-9.3 postgresql-client-9.3 postgresql-9.5 postgresql-client-9.5
$ dpkg --get-selections | grep postgresql # Sanity check
# Ensure ip4r is installed properly
$ apt-get -y install postgresql-contrib postgresql-10-ip4r
# Create extension in the pgsql instance, as Pg superuser
$ su - postgres
$ psql # Top-level
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
$ psql --dbname=jenkins_mirrorbrain_db # On the mirrorbrain database
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
# Load the ASN script, now that the primitive type `iprange` is provided by the ip4r extension
$ psql --host=localhost --username=jenkins_mirrorbrain --password --dbname=jenkins_mirrorbrain_db --file=/usr/share/doc/libapache2-mod-asn/asn.sql
password: <redacted>
# Ensure everything is loaded and available
$ apt update && apt-get dist-upgrade && apt-get autoremove --purge && update-grub && reboot
$ tail -f /var/log/apache2/*log
Another error on the apache log, but no solution for now:
[Sat May 07 10:47:48.548369 2022] [mpm_event:error] [pid 1651:tid 140147096673216] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
Sounds related to https://www.claudiokuenzler.com/blog/948/apache-2.4-mpm-event-bug-freezing-up-scoreboard-full-after-reload (yes we are using MPM event, and the /server-status
shows a lot of Apache threads in a G
state for loooong time.
In order to help on this area, installed sysstat to provide a finer metric grain
$ apt-get update -q && apt-get install -y sysstat
$ vi /etc/default/sysstat # changed `ENABLED` to `true`
$ vi /etc/cron.d/sysstat # changed to collection every 2 min
$ systemctl enable sysstat
$ systemctl start sysstat
It appears that there are peaks of CPU on %system
when the slowness appears:
10:00:01 AM all 8.87 0.00 3.10 0.06 0.12 87.86
10:02:01 AM all 28.04 0.00 4.75 0.27 0.14 66.79
10:04:01 AM all 26.21 0.00 4.87 0.15 0.21 68.55
10:06:01 AM all 33.32 0.00 12.11 0.13 1.68 52.77
10:08:01 AM all 30.51 0.00 11.64 0.08 1.68 56.08
10:10:01 AM all 27.46 0.00 13.96 0.05 1.72 56.81
10:12:01 AM all 30.69 0.00 13.89 0.11 1.66 53.66
10:14:01 AM all 30.90 0.00 11.48 0.11 1.69 55.82
10:16:01 AM all 27.94 0.00 13.86 0.08 1.71 56.41
10:18:01 AM all 29.40 0.00 14.48 0.07 1.66 54.39
10:20:01 AM all 27.84 0.00 13.03 0.06 1.72 57.35
10:22:01 AM all 23.33 0.00 4.35 0.14 0.23 71.96
10:24:01 AM all 21.31 0.00 3.50 0.06 0.11 75.01
We might check the configuration history:
Let's see how the machine behaves with the postgresql + ip4r fix.
Merged the PR on the pipeline library: let's monitor the upcoming Jenkins core, ATH and bom builds.
mirrors.jenkins.io
changed from IN A 52.202.51.185
to CNAME get.jenkins.io.
(TTL 1 min) today at ~08:10am UTCupdates.jenkins.io
was a CNAME
to mirrors.jenkins.io
(reported in IRC around ~08:47am in the gitter channel jenkins/jenkins by a user)
updates.jenkins.io
change to IN A 52.202.51.185
around 09:00am UTC` and TTL was changed from 1 hour to 1 minuteStarting maintenance on the VM:
In parallel, https://github.com/jenkins-infra/status/pull/166 was opened to prepare puppet so we can put this machine under automatic puppet management again.
Ran the following command on the VM (after snapshoting + backuping postgres data):
apt-get remove --purge postgresql-10 postgresql-10-ip4r postgresql-client-10 postgresql-client-common postgresql-common postgresql-contrib mirmon mirrorbrain mirrorbrain-scanner mirrorbrain-tools
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
formencode-i18n libalgorithm-c3-perl libauthen-sasl-perl libb-hooks-endofscope-perl libclass-c3-perl libclass-c3-xs-perl libclass-data-inheritable-perl
libclass-inspector-perl libclass-method-modifiers-perl libclass-singleton-perl libconfig-inifiles-perl libdata-dump-perl libdata-optlist-perl
libdatetime-locale-perl libdatetime-perl libdatetime-timezone-perl libdbd-pg-perl libdbi-perl libdevel-caller-perl libdevel-lexalias-perl
libdevel-stacktrace-perl libdigest-md4-perl libencode-locale-perl libeval-closure-perl libexception-class-perl libfile-listing-perl libfile-sharedir-perl
libfont-afm-perl libhtml-form-perl libhtml-format-perl libhtml-parser-perl libhtml-tagset-perl libhtml-tree-perl libhttp-cookies-perl libhttp-daemon-perl
libhttp-date-perl libhttp-message-perl libhttp-negotiate-perl libio-html-perl libio-socket-inet6-perl libio-socket-ssl-perl liblwp-mediatypes-perl
liblwp-protocol-https-perl libmailtools-perl libmodule-implementation-perl libmro-compat-perl libnamespace-autoclean-perl libnamespace-clean-perl
libnet-http-perl libnet-smtp-ssl-perl libnet-ssleay-perl libpackage-stash-perl libpackage-stash-xs-perl libpadwalker-perl libparams-util-perl
libparams-validationcompiler-perl libreadonly-perl libref-util-perl libref-util-xs-perl librole-tiny-perl libsocket6-perl libspecio-perl
libsub-exporter-perl libsub-exporter-progressive-perl libsub-identify-perl libsub-install-perl libsub-quote-perl libtry-tiny-perl liburi-perl
libvariable-magic-perl libwww-perl libwww-robotrules-perl perl-openssl-defaults python-cmdln python-dnspython python-formencode python-mb
python-pkg-resources python-pydispatch python-sqlobject python3-dnspython python3-formencode python3-pydispatch python3-sqlobject sqlobject-admin
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
mirmon* mirrorbrain* mirrorbrain-scanner* mirrorbrain-tools* postgresql-10* postgresql-10-ip4r* postgresql-client-10* postgresql-client-common*
postgresql-common* postgresql-contrib*
0 upgraded, 0 newly installed, 10 to remove and 3 not upgraded.
After this operation, 20.9 MB disk space will be freed.
followed by the autoremove.
Also, removed manually all the apache vhost configurations (after backuping it) related to domains mirrors.jenkins or get.jenkins.io.
Still some apache config to clean up
pkg
(with the whole puppet certificate regeneration)Just in case : backups of apache2 etc and var are in the /root if anything breaks + there is a snapshot of the vm root volume in aws
Summary of the past days:
A lot of people help, and I'm really glad for it!
Next step:
Yet another incident due to this issue: https://github.com/jenkins-infra/helpdesk/issues/2960
Closing as the incidents seems to be gone (all of them).
Service(s)
Update center, Other
Summary
What Happened
Since 4 weeks, the infra team receives the following pager duty alert:
Weird Response time https://updates.jenkins-ci.org
multiple times a day.Click to see details
The alerts are triggered by a threshold in the datadog metrics collection for this service: https://github.com/jenkins-infra/docker-datadog/blob/main/conf.d/http_check.d/jenkins.yaml#L137-L148. As shown in the screenshots, it means that the average HTTP response time is increased past 10s most of the time (when the alert is triggered).Most of the time, the alert acknowledge itself as the response time decreased. Sometimes, the person on duty (@MarkEWaite or I) have to SSH to the machine
pkg.origin.jenkins.io
and restart the Apache server (rebooting the machine would be the last option).Root cause
The (legacy) service referenced as
mirrorbrain
(hosting the servicesmirrors.jenkins.io
andmirrors.jenkins-ci.org
), also hosted on this VM is causing a peak of CPU usage which slows done the other serviceupdates.jenkins.io
.Click to expand for details on the configuration as code
Puppet configuration audit trail: - VM definition: https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122 - This VM has the role `mirrorbrain`: https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/role/manifests/mirrorbrain.pp#L4-L7 - This role is composed of 4 profiles: - `base` (as all VMs managed by Puppet): https://github.com/jenkins-infra/jenkins-infra/blob/7c6d6609b650f1ef209cd590dd4568bcc676514c/dist/profile/manifests/base.pp - `mirrorbrain` (which defines `mirrors.jenkins*` services, that we want to sunset): [mirrorbrain](https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/mirrorbrain.pp) - `updatesite` (which defines the update center site, causing alerts because slowed down): https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/updatesite.pp - `pkgrepo` (used to build and host the Jenkins packages for debian/centos/etc. to be replaced later but not part of this issue: keep it for now) - https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122Proposal
Let's sunset the legacy service
mirrorbrain
in favor of the currentget.jenkins.io
modern mirror service based on mirrorbits!Rationale:
mirrorbits
defaults to HTTPS, whilemirrorbrain
only supports plain old HTTPmirrorbits
can scale horizontally and efficiently (redis database, hosted in Kubernetes) and is updated regularly and automaticallyClick to expand for details about the mirrorbits service
- Mirrorbits Helm Chart: https://github.com/jenkins-infra/helm-charts/tree/main/charts/mirrorbits - Configuration of get.jenkins.io (installation of the mirrorbits chart): - Helmfile Manifest at https://github.com/jenkins-infra/kubernetes-management/blob/main/clusters/prodpublick8s.yaml#L180-L189 - Custom values: https://github.com/jenkins-infra/kubernetes-management/blob/main/config/mirrorbits.yamlIn order to NOT break end-users installations, the domains
mirrors.jenkins.io
andmirrors.jenkins-ci.org
should be CNAMEs to themirrorbits
new system.Known usages of the legacy mirror system
jenkinsci
(pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:To Do List
mirror.jenkins.io
andmirrors.jenkins-ci.org
in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)