ros-infrastructure / buildfarm_deployment

Apache License 2.0
30 stars 39 forks source link

Jenkins slave swarm jar not downloading #149

Closed gavanderhoorn closed 6 years ago

gavanderhoorn commented 7 years ago

I consistently run into the situation where puppet seems to run to completion but none of the slaves come up.

It turned out that swarm-client-1.22-jar-with-dependencies.jar in the jenkins-slave home dir was always 0 bytes. puppet.log also showed problems fetching the file:

master: Error: wget -O /home/jenkins-slave/swarm-client-1.22-jar-with-dependencies.jar http://maven.jenkins-ci.org/content/repositories/releases/org/jenkins-ci/plugins/swarm-client/1.22//swarm-client-1.22-jar-with-dependencies.jar returned 4 instead of one of [0]

Downloading it manually (but from here, not the original URL) made things work again.

This looks like it's related to jenkinsci/puppet-jenkins#507, but that is closed, marked (sort-of) resolved (as the upstream issue was resolved). It does not seem to work for me though.

gavanderhoorn commented 7 years ago

Just found jenkinsci/puppet-jenkins@cb458643fe9adebd3d8e7c8a54865c48db64fbf9:

Use a up-to-date swarm client URL

which is included in 1.7.0. The Puppetfile for master pins rtyler/jenkins to 1.6.1, which doesn't have it.

Adding

source => 'http://repo.jenkins-ci.org/releases/org/jenkins-ci/plugins/swarm-client/1.22',

to the jenkins_slave class stanza seems to work around this, but that only works for version 1.22 of the plugin. I'm uncertain where puppet gets the version it should download.

jenkinsci/puppet-jenkins/manifests/slave.pp says:

The version of the swarm client code. Must match the pluging version on the master. Typically it's the latest available.

nuclearsandwich commented 7 years ago

I hit this same issue when trying to set up a buildfarm on Ubuntu Trusty. I didn't find a solution and instead forged ahead refactoring the puppet scripts and targeting Ubuntu Xenial (that work is in #146).

As far as I know, the swarm plugin version was determined by the upstream jenkins component module rather than anything in this repository. I created a script to sync/pin all plugin versions configured to the versions on the current buildfarm but the swarm plugin is ignored by that script as it is intended to be managed by the jenkins component module.

I tried to deploy a working trusty buildfarm from these configs for reference before starting the refresh to xenial but while I was able to get something running, it was extremely brittle and required a lot of manual work updating plugins, manually downloading the swarm jar, and hand-tuning configs before it was serviceable.

The xenialize branch isn't quite in a recommendable state yet but if you want to check it out the corresponding changes to the public configuration are in this fork's xenialize branch.

If you wanted to avoid making so drastic a change, I'm pretty sure you can bump the jenkins puppet module version and see if that unblocks you. I can't recall there being any incompatible changes I had to deal with updating it for xenial.

gavanderhoorn commented 7 years ago

@nuclearsandwich wrote:

If you wanted to avoid making so drastic a change, I'm pretty sure you can bump the jenkins puppet module version and see if that unblocks you. I can't recall there being any incompatible changes I had to deal with updating it for xenial.

well, there's #143. I didn't want to start experimenting with that yet.

With the set of changes I have I can now quite easily deploy a farm, so as much as I've been intrigued/tempted by your Xenial-porting work I think I'll first get this working :).

nuclearsandwich commented 7 years ago

so as much as I've been intrigued/tempted by your Xenial-porting work I think I'll first get this working

As I mentioned, we don't yet have a complete farm up and running on Xenial so I'd definitely recommend sticking with what you've got. That config is being used on a test buildfarm but it's building new packages not anything from the existing ROS ecosystem yet. Before it merges I'm hoping to also add additional documentation around setting up a buildfarm as I've been logging the process of spinning mine up for the first time.

gavanderhoorn commented 7 years ago

As I mentioned, we don't yet have a complete farm up and running on Xenial.

I might've not been entirely clear, but I'm sticking to Trusty for now :). I understand that your xenial work is experimental.

gavanderhoorn commented 7 years ago

Will all efforts be focused on the Xenial update btw, or will the Trusty version also still be maintained?

nuclearsandwich commented 7 years ago

But am I to understand that this issue will not be addressed anymore, now that work is underway to get a Xenial farm working?

In short, I don't have plans to implement and test an official solution but if you're able to find an acceptable workaround, I'd create the new trusty legacy branch early and accept pull requests to it. If the xenial branch was a bit further along and better tested, I'd recommend adopting it straightaway. Unfortunately I don't think it's quite there yet.

I don't believe anyone besides myself at OSRF has used this code to set up a buildfarm since the current build.ros.org was originally provisioned some years ago. If everything worked as-is but this slave jar url I'd be in a better position to offer or accept a concrete solution.

I don't see myself updating the current master branch until we move it to xenial since getting it operational under trusty wasn't a slam dunk and a high motivator for moving to xenial is access to up-to-date Java and Docker both of which are impacting the current buildfarm.

It looks like you have two possible workarounds.

  1. Update the Jenkins puppet module to 1.7.0, which still supports trusty and which I believe still works with the current puppet code.

  2. Specify an explicit source url to the swarm plugin. You could likely use version 2.0 or 3.4 and be fine as I've seen both on the upstream buildfarm in recent weeks.

I think my recommendation would be updating the puppet module.

tfoote commented 7 years ago

@nuclearsandwich There are several other instances of the buildfarm being run. I can give you more details in person.

nuclearsandwich commented 7 years ago

@nuclearsandwich There are several other instances of the buildfarm being run. I can give you more details in person.

Sounds good! I'm definitely not trying to make official policy out of my words, just stating what I'm comfortable and capable of maintaining.

gavanderhoorn commented 7 years ago

For now I've settled on:

This keeps the versions used the same without running into the issue that the jenkins module cannot download the jar.

@nuclearsandwich: you mention that versions 2.0 and 3.4 are also used on (one of) the buildfarm(s) you guys run: did you run into #143 there? How was that fixed?

nuclearsandwich commented 7 years ago

you mention that versions 2.0 and 3.4 are also used on (one of) the buildfarm(s) you guys run

build.ros.org is currently using 3.4. The executor nodes on that farm use AWS's auto scaling and I think just use an image based on a "golden" puppet-configured machine rather than re-running the puppet each time one is spun up. @tfoote and I had an offline conversation about it and I can't recall if we're currently using an AWS Jenkins plugin but wanted more from it, or if we're currently using AWS's built in scaling based on CPU load and haven't incorporated the plugin because it doesn't add the things we want.

did you run into #143 there? How was that fixed?

I don't have an explicit answer as the current buildfarm predates me. What follows is conjecture:

Based on the node names on http://build.ros.org/ it looks like the name of the auto-scaling swarm clients is left to the default aws hostname. If I had to guess I'd say that the plugins have been updated on build.ros.org out of sync with the puppet code there but I don't have console access to those machines at the moment to verify.

I noticed the unique suffix on node names when testing the new xenial code in clusters. I'll likely end up adding the -disableClientsUniqueId flag to the default agent behavior and using Puppet facts to generate a unique-to-host node name but I haven't done so yet or fully validated that approach.

tfoote commented 7 years ago

The executor nodes on that farm use AWS's auto scaling and I think just use an image based on a "golden" puppet-configured machine rather than re-running the puppet each time one is spun up. @tfoote and I had an offline conversation about it and I can't recall if we're currently using an AWS Jenkins plugin but wanted more from it, or if we're currently using AWS's built in scaling based on CPU load and haven't incorporated the plugin because it doesn't add the things we want.

We have a snapshot AMI that aws autoscales. And it starts the jenkins swarm agent via a puppet update. So changes to puppet will propogate.


For the tags, I'd highly recommend not relying on the hostnames, but instead use labels related to the capabilities of the machines. For example on_master or on_jenkins could be labels for the executors running on the same machine. This will move us away from requiring multi hosts. And we won't care what the names are.

gavanderhoorn commented 6 years ago

The new xenial deployment scripts don't seem to have this problem, so closing.