nodejs / build

Better build and test infra for Node.
503 stars 165 forks source link

infra: spring software updates #1222

Closed refack closed 4 years ago

refack commented 6 years ago

I'm not sure how to coordinate this, but it IMHO we should do some systematic updates to the software on our infra. I'm referring to peripheral software such as Java, slave.jar and git (not OS or compilers). Besides minimizing potential bit-rot, and making us feel better in general, I have an intuiting it is already casing failures, and blocking process improvments. For example:

  1. https://github.com/nodejs/build/issues/173#issuecomment-379858648 Jenkins failing to communicate with workers — might be related to stale slave.jar: on failing machines there was an old agent running image after restart it bumps to: image "Log" show this warning: image

  2. Old git (I mean 1.8 when the latest is 2.17) doesn't handle sparse checkouts, which degrades the overall performance of the cluster: image

  3. My estimation is that we also have on some platforms outdated sshd with potential security issues (also we should disable plain-text password login where possible RE: https://github.com/nodejs/build/issues/866)


Since I now have time for such tasks, I'm seeking feedback / pitfalls / warnings. And also ideas on how to coordinate such efforts (RE @gibfahn and the Java8 project).

refack commented 6 years ago

A computer similar to my above example just popped (test-digitalocean-ubuntu1604-x86-1 went into a remoting perma-fail): image Solved by updating slave.jar

gibfahn commented 6 years ago

I'm not sure how to coordinate this, but it IMHO we should do some systematic updates to the software on our infra.

Sounds like a great idea to me.

Besides minimizing potential bit-rot, and making us feel better in general, I have an intuiting it is already casing failures, and blocking process improvments. For example: Since I now have time for such tasks, I'm seeking feedback / pitfalls / warnings. And also ideas on how to coordinate such efforts (RE @gibfahn and the Java8 project).

In my opinion the biggest source of bitrot is that we don't run our Ansible scripts on the machines regularly, so we can't trust that they'll work on the machines (so we just update things manually because we don't have time etc. etc.)

My ideal update scenario is a weekly job that runs the scripts against all the machines.

If that is how we want to progress, the first step is to document the list of machines we can't use Ansible on (@rvagg has more info here), either because the scripts haven't been implemented/ported yet, or because the machines can't be updated as it will break custom things we've done to them.

rvagg commented 6 years ago

As per today's meeting today, "Error fetching remote repo" errors seem to be fixed by upgrading git on the machines. I did a bunch of that yesterday in https://github.com/nodejs/build/pull/1224, CentOS5 was done about a month ago by manually compiling git (doc for that is in this repo) and CentOS6 was done yesterday @ https://github.com/nodejs/build/pull/1223.

The other error relates to git but I the stacktrace suggests it's more to do with the remote call mechanism of Jenkins. We've had these errors for a long time and they seem to have been solved variously by: restarting jenkins, restarting machines, clearing workspaces, upgrading slave.jar and upgrading java.

As I've already mentioned, I can't solve the error on one of the two smartos16 machines so I took it offline this week: https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/, the only thing I haven't tried is changing the Java version is used but I'm not sure I can even do that on SmartOS.

rvagg commented 6 years ago

Oh and re updating slave.jar, I'd be happy to see that done as part of the init/upstart/systemd scripting. It used to be built in to start.sh on the Raspberry Pi's and a bunch of other machines but we've stripped that out of most builds. That requires a bit of work of course but it wouldn't be hard to deploy.

btw there is also ansible/playbooks/jenkins/worker/upgrade-jar.yml that you could try using. I haven't used it myself but it's worth playing with cause it could be run across most of our infra.

refack commented 6 years ago

https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/ fixed by restarting slave.jar (smartos incantation is svcadm restart jenkins). Before doing that I checked https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/systemInfo and it still showed Unix slave, version 2.67 🤷‍♂️

Another assumption I had as to the cause of the failures was related to the owner of slave.jar, if it should be root.root or iojs.iojs. For now on test-joyent-smartos16-x64-2 I didn't chown it, so it's still is owned by root. I want to see if it makes any difference.

joaocgreis commented 6 years ago

Oh and re updating slave.jar, I'd be happy to see that done as part of the init/upstart/systemd scripting.

+1, this has been part of the Windows script for a few years now and works great. The only drawback is that this is not straightforward for ci-release because it is locked, but this shouldn't stop us for test ci.

gdams commented 6 years ago

@joaocgreis I would reccomend using https://adoptopenjdk.net/ java binaries if we plan to upgrade all of our machines. There is a nice API (https://api.adoptopenjdk.net/README) detailing how it can be used. I'd be happy to work through the playbooks and switch out the java sections to use this if everyone is happy with that?

joaocgreis commented 6 years ago

@gdams we started using Oracle Java at some point because it seemed to have better performance than the Open JDK that was installed in the machines. This was noticeable in the Jenkins server that is frequently under heavy load, and in the Raspberry Pis. However, this was only one of the things we did at the time and I'm not completely sure it was the cause of the improvement. If you feel sure about Open JDK performance, I wouldn't object to try it again (provided @rvagg is ok with that as well).

To be clear, when I mentioned updating slave.jar above, I did not mean updating Java, only the jar file that we run in the workers.

sxa commented 6 years ago

I wouldn't expect it to have different performance characteristics since it's fundamentally the same code. If there are scenarios in which the performance isn't the same, that would be useful for adoptopenjdk to be aware of, so I would be in favour of giving it another shot.

keithc-ca commented 6 years ago

The java code is mostly the same, but there are differences in the VM performance. Check out [1] for some more information about OpenJDK with OpenJ9, including some performance advantages that come with the OpenJ9 VM.

[1] https://www.eclipse.org/openj9/oj9_resources.html

gdams commented 6 years ago

Yes thanks @keithc-ca! It's worth pointing out that you can also fetch OpenJ9 binaries from AdoptOpenJDK! https://adoptopenjdk.net/releases.html?variant=openjdk8-openj9

sxa commented 6 years ago

Yes openjdk+openj9 will have different performance characterstics as @keith-ca says but the openjdk+hotspot builds from adoptopenjdk should be pretty much the same as oracle's current ones

rvagg commented 6 years ago

I have no objections to switching to openjdk, I don't know if it buys us anything here but being able to get on to Java 9 might be helpful I suppose?

BridgeAR commented 6 years ago

What's the status here? Should this stay open or is this resolved?

github-actions[bot] commented 4 years ago

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.