nodejs / build

Better build and test infra for Node.
506 stars 165 forks source link

Read-only file system on release CI #2626

Closed richardlau closed 2 years ago

richardlau commented 3 years ago

After landing https://github.com/nodejs/nodejs-dist-indexer/pull/15, I checked https://nodejs.org/download/nightly/, to see if I'd missed the nightly, and noticed that we haven't had a nightly build since 17 April but I know the master branch of the core repo has been updated since then.

From https://ci-release.nodejs.org/log/all

Apr 19, 2021 1:00:08 AM WARNING jenkins.model.lazy.LazyBuildMixIn newBuild
A new build could not be created in job iojs+release
java.io.IOException: Read-only file system
    at java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.io.File.createTempFile(File.java:2024)
    at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:142)
Caused: java.io.IOException: Failed to create a temporary file in /var/lib/jenkins/jobs/iojs+release
    at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:144)
    at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:109)
    at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:84)
    at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:74)
    at hudson.util.TextFile.write(TextFile.java:116)
    at hudson.model.Job.saveNextBuildNumber(Job.java:283)
    at hudson.model.Job.assignBuildNumber(Job.java:342)
    at hudson.model.Run.<init>(Run.java:322)
    at hudson.model.AbstractBuild.<init>(AbstractBuild.java:166)
    at hudson.matrix.MatrixBuild.<init>(MatrixBuild.java:79)
Caused: java.lang.reflect.InvocationTargetException
    at sun.reflect.GeneratedConstructorAccessor193.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:181)
    at hudson.model.AbstractProject.newBuild(AbstractProject.java:963)
    at hudson.model.AbstractProject.createExecutable(AbstractProject.java:1139)
    at hudson.model.AbstractProject.createExecutable(AbstractProject.java:138)
    at hudson.model.Executor$1.call(Executor.java:365)
    at hudson.model.Executor$1.call(Executor.java:347)
    at hudson.model.Queue._withLock(Queue.java:1443)
    at hudson.model.Queue.withLock(Queue.java:1304)
    at hudson.model.Executor.run(Executor.java:347)
richardlau commented 3 years ago

cc @nodejs/build-infra

rvagg commented 3 years ago

on it

rvagg commented 3 years ago

hmm, this might be more complicated than just a dodgy filesystem, having to resort to some rescue operations 🤞

rvagg commented 3 years ago

@richardlau do you happen to have the IBM Cloud VPN set up? I think we need to look at the console of the machine, it's in a 1/2 booted state and not accepting SSH connections. I can use the IBM Cloud rescue mode and everything looks fine from a superficial perspective and I don't see any problems in logs so I think watching a boot might be the next step if we can.

Machine is https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details, in the Actions menu there's a "KVM Console" but it needs the VPN.

Alternatively we just take this chance to set up an entirely new 20.04 machine and transfer what settings we can from a rescue boot of this one. I just haven't figured out how to access the additional disk we use for /var/lib/jenkins in rescue mode but maybe we just figure out how to transfer it to a new machine. I probably can't do this today but may be able to allocate a bit of time tomorrow to have a go (it doesn't have to be me if someone else with infra is brave enough).

richardlau commented 3 years ago

I don't currently have IBM Cloud VPN set up. Should probably do so anyway so I'll look into doing that.

I probably can't do this today but may be able to allocate a bit of time tomorrow to have a go (it doesn't have to be me if someone else with infra is brave enough).

This is one of those things I'm scared to touch 😅.

rvagg commented 3 years ago

they don't offer 20.04 but I'll start a 18.04 from scratch and try and transfer boot disk data from rescue mode to start with, I'll let you know here when I stop for the day (soon)

richardlau commented 3 years ago

FWIW I think this was the cause of the read only filesystem:

At 17 April 2021 20:16 UTC, customer back-end traffic in the DAL09 data-center may have started experiencing intermittent network connectivity. At 17 April 2021 20:47 UTC, this intermittent back-end network connectivity cleared. During this period, customer back-end traffic in the DAL09 data-center may have experienced degraded network connectivity in the form of latency and/or packet loss. VSI's hosts may have found that their file-systems went read-only requiring a reboot to restore read/write access. If you are still experiencing any issues please reach out to our support department and reference this event ID.

rvagg commented 3 years ago

Sounds like it!

I've just made an 18.04 with basic setup, jenkins & java (I haven't bothered looking to see if we have any ansible scripts for this, it'll be way out of date & sync anyway!). I have the old machine in rescue mode and I've just done an rsync of / and /boot onto the new machine in /old-ci-release/. This is my inventory for the new machine infra-ibm-ubuntu1804-x64-1: {ip: 169.45.166.50}. It's got nodejs_build_infra in it.

I've also managed to get access to the /var/lib/jenkins disk but it's a complete mess. The superblock was borked and an fsck has moved everything to lost+found, we've lost most of the directory structure but retained most of the files. It's going to be impossible to rebuild this I think but we have access to important pieces if we can find them (with lots of find and grep ..).

It looks like our backup is active though (thanks to whoever fixed that up when it overflowed last time!). /backup/periodic/daily.0/ci-release.nodejs.org has a most-recent file dated April 17th. I'm currently rsyncing that to the new server as /jenkins-backup/. Hopefully it contains enough of the key pieces to get this all back online properly.

I have to head off for the day and have a busy day ahead of me tomorrow but will try and find some time to hop back in here and continue.

Here's the next steps I think we'll need to take (feel free to try and tackle any/all of them without me!):

rvagg commented 3 years ago

btw, if you want to log into the old machine, it's in rescue mode and will require a password which you can get from https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details#passwords, it's some weird IBM Cloud OS but I have all the disks mounted under /mnt/ (see df).

richardlau commented 3 years ago

+1 Thanks for your time @rvagg. I'll see how far I can get through the list and keep this updated. FWIW I've been trying to set up IBM Cloud VPN but it's proving challenging, especially with the not-an-admin restrictions I have on my corporate laptops.

rvagg commented 3 years ago

rsync of /jenkins-backup/ is done now btw

richardlau commented 3 years ago

Allocate a new 300G disk in Dallas 9,

I've requested that, showing as jenkins-release under portable storage, but haven't figured out how to attach it to infra-ibm-ubuntu1804-x64-1.nodejs.cloud.

richardlau commented 3 years ago

I've run iptables-restore /old-ci-release/root/richard-20210311 to apply what I believe is the most recent backed up iptables edit we've done to the release server (it contains, for example, release-macstadium-macos11.0-arm64-1 which is the Apple Silicon release machine). I'm double checking the Joyent IP addresses as that was the other most recent change that I remember.

Update: The new Joyent IP's are in the firewall rules. I got slightly confused as we still have some of the old IP addresses in there still, so potential for some tidying up later.

richardlau commented 3 years ago

I think I have nginx set up on the new server, copying across config from /old-ci-release/etc/nginx (and updating the apparently deprecated spdy to http2 in the /etc/nginx/sites-available/jenkins-iojs). At least the nginx.service started with no obvious errors, but without Jenkins started there's nothing to forward to 😄.

mhdawson commented 3 years ago

Been on with support for the last hour or so, seems like something still needs to be done so that the disk will show up.

Also got the vpn going on one of my machines, but the KVM console does not seem to give me anything so that is the next thing to ask them about

mhdawson commented 3 years ago

From the cloud UI, looks like the storage is now attached. Next we need to check it is accessible in the machine itself.

richardlau commented 3 years ago

Looks like it might be /dev/xvdc based on the timestamps:

root@infra-ibm-ubuntu1804-x64-1:~# ls -al /dev/xvd*
brw-rw---- 1 root disk 202,  0 Apr 19 11:08 /dev/xvda
brw-rw---- 1 root disk 202,  1 Apr 19 11:08 /dev/xvda1
brw-rw---- 1 root disk 202,  2 Apr 19 11:08 /dev/xvda2
brw-rw---- 1 root disk 202, 16 Apr 19 11:08 /dev/xvdb
brw-rw---- 1 root disk 202, 17 Apr 19 11:08 /dev/xvdb1
brw-rw---- 1 root disk 202, 32 Apr 19 18:19 /dev/xvdc
root@infra-ibm-ubuntu1804-x64-1:~#

I'll try mounting that via fstab and see what we get.

mhdawson commented 3 years ago

I do have KVM console access now. Looks like the rescue os, not sure if we want to reboot yet.

richardlau commented 3 years ago

Partitioned /dev/xvdc and formatted /dev/xvdc1 as ext4. Mounted and now copying across /jenkins-backup to /var/lib/jenkins.

richardlau commented 3 years ago

Copy is complete. Jenkins started up (I had to restart as I'd changed owner/group for the files/directories under /var/lib/jenkins but forgot the /var/lib/jenkins dir itself). I don't think I can test access to it without switching the dns entry in Cloudflare (if I go to the IP address of the new server (169.45.166.50) it times out saying ci-release.nodejs.org is taking too long to respond). Will look at that next.

richardlau commented 3 years ago

DNS has been switched. https://ci-release.nodejs.org/ loads (🎉). Executors are mostly offline -- will go around and see if restarting the agent on a few of them is enough to get them connected 🤞 .

richardlau commented 3 years ago

The executors/nodes reconnected themselves after a few minutes 😀. Started a test build 🤞: https://ci-release.nodejs.org/job/iojs+release/6834/

richardlau commented 3 years ago

Test build is green. I think we'll be able to release Node.js 16 tomorrow 🤞 😌 (cc @BethGriggs ). Thanks @rvagg and @mhdawson for your help.

I did spot that we're missing a cross compiler for armv7l but that's related to moving to gcc 8 and not because of the server issue in this PR. I've added the cross-compiler-ubuntu1804-armv7-gcc-8 label to iojs+release (https://github.com/nodejs/jenkins-config-release/commit/63b1a778614c15f4e1be29e051fffc011c788b59) and started a build to test that: https://ci-release.nodejs.org/job/iojs+release/6835/

Also the AIX build in https://ci-release.nodejs.org/job/iojs+release/6834/nodes=aix72-ppc64/ seemed to take 49 minutes to scp the binary over to the download server which seems slow (albeit it did complete successfully) but shouldn't have anything to do with the Jenkins server. I've restarted the agent on the AIX release machine in any case and look at the nightlies in the morning to see if it's still an issue.

There's a remaining task on the list about checking the backup server can connect to the new server but I'm kind of beat for the day and am going to log out.

richardlau commented 3 years ago

Updated /etc/crontab to add the backups to https://github.com/nodejs/jenkins-config-release (This is separate from the backup item listed in https://github.com/nodejs/build/issues/2626#issuecomment-822404824)

richardlau commented 3 years ago

I think we also need to update https://grafana.nodejs.org/ for the new ci-release server, but I have no idea how to do that. image

rvagg commented 3 years ago

Wow, fantastic work @richardlau! And in this we got an upgrade off 16.04 plus an OpenJDK JVM for Jenkins. So a value-added recovery.

ARMv7 is green and it's using GCC 8:

07:46:32 + ccache /opt/raspberrypi/rpi-newer-crosstools/x64-gcc-8.3.0/arm-rpi-linux-gnueabihf/bin/arm-rpi-linux-gnueabihf-gcc -march=armv7-a --version
07:46:32 arm-rpi-linux-gnueabihf-gcc (crosstool-NG 1.24.0-rc3) 8.3.0

and compiling with:

ccache /opt/raspberrypi/rpi-newer-crosstools/x64-gcc-8.3.0/arm-rpi-linux-gnueabihf/bin/arm-rpi-linux-gnueabihf-g++ -march=armv7-a -o 

The grafana setup was @jbergstroem, I saw a custom APT source in there for that but there's probably also some config that needs to be in place as well.

Re Node 16, I don't know if there's time for another RC but it might be worth @BethGriggs running through the motions today if possible to test it all out.

richardlau commented 3 years ago

Another 16 rc was started in https://github.com/nodejs/node/pull/37678#issuecomment-822912152: https://ci-release.nodejs.org/job/iojs+release/6836/

AshCripps commented 3 years ago

I think we also need to update https://grafana.nodejs.org/ for the new ci-release server, but I have no idea how to do that.

I assume it just need the telegraf agent redeployed/reconfigured?

richardlau commented 3 years ago

I've put the backup ssh key in the authorized_keys on the new server and checked I can ssh into ci-release.nodejs.org from the backup server. I reset the host key for ci-release.nodejs.org on the backup server.

richardlau commented 3 years ago

Backup looks to have worked. I'll look at the telegraf agent for grafana, which I think is the only remaining thing left.

richardlau commented 3 years ago

Added the telegraf agent to the server and copied the config over from the old ci-release disk:

$ curl -s https://repos.influxdata.com/influxdb.key | sudo apt-key add -
$ source /etc/lsb-release
$ echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
$ apt-get update
$ apt-get install telegraf
$ cp -R /old-ci-release/etc/telegraf/* /etc/telegraf/
$ systemctl restart telegraf.service

(the steps to install the telegraf agent taken from history on the ci server and follow https://docs.influxdata.com/telegraf/v1.18/introduction/installation/).

Grafana now shows stats for ci-release (🎉): image

I believe everything's now done. I'll follow up separately with some more docs based on what we had to do to get all of this back up and running.

rvagg commented 3 years ago

Let's leave this open until we've cleaned up old resources - we have the server plus the 300G disk that we need to spin down. Do we have confidence to do this now or should we wait?

mhdawson commented 3 years ago

Maybe we should wait until we've done a release for all of the release lines in case there is some release specific dependency?

richardlau commented 3 years ago

I don't believe there is any release specific dependency for the release CI server outside of the Jenkins job configuration as the individual release- machines and staging server were unaffected.

I'm reasonably confident that we can clear up the old resources but I can run test builds if desired (I don't think it's necessary).

FWIW in terms of storage I noticed we have an unattached 1000GB at Dallas 5 -- anyone have any idea what that is? I'm assuming if it's unattached it can be deleted? image jenkins is the old 300GB disk and the unattached jenkins-release-new was @mhdawson 's attempt to create portable storage as we initially struggled to attach the new storage (jenkins-release) to the replacement server.

I've also started a "disaster recovery plan" document over in https://github.com/nodejs/build/pull/2634 with pointers to where we have backups/alternative places to recover configuration.

richardlau commented 3 years ago

FWIW in terms of storage I noticed we have an unattached 1000GB at Dallas 5 -- anyone have any idea what that is? I'm assuming if it's unattached it can be deleted?

We appear to only have one server at Dallas 5: image

but that isn't the current test-softlayer-centos6-x64-2, which is at Washington 7 (IP matches the one in the inventory): image

It appears the one at Dallas 5 was replaced at some point in the past https://github.com/nodejs/build/issues/2480 / https://github.com/nodejs/build/issues/1074. So we can probably get rid of the Dallas 5 server and the unattached storage?

FWIW I also don't think it's a great idea to have both of the test centos6 x64 hosts at the same datacenter with the same cloud provider (i.e. an outage could potentially take out all of our centos6 x64 test hosts) but maybe we can retire centos 6 entirely when Node.js 10 goes End-of-Life at the end of the month (only three days left!).

rvagg commented 3 years ago

I'm pretty confident the unattached 1000GB disk can go, nothing lost there. Mostly these cases of unattached disks are a result of a failure to cleanup after some kind of migration (to a new DC or to resize the disk). We don't use unattached disks anywhere in our storage strategy so if nobody claims it for very-recent use (like the unattached 300GB old jenkins disk) then it can go. rm all the things.

sxa commented 3 years ago

@richardlau Can this be closed now? It's still sitting as a pinned issue so shows up and gives me a minor panic each time I go to the issue list in this repository :-)

richardlau commented 3 years ago

We still need to clear up the old resources, but this definitely doesn't need to remain pinned.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.