Closed mikeferguson closed 4 days ago
It appears this may need to be done slightly differently. The git-pbuilder defaults to base-DIST-ARCH.cow, and I haven't found a way around this.
Since we only save back to the cowbuilder during the update script, and looking back through my logs this is also the only point where he hit snags, perhaps the script should instead check if the cowbuilder is already being updated, and then wait until it is done?
@mikepurvis you mentioned hitting at least one of theses, can you confirm that it was only in the cowbuilder-update.py step?
I hit it in two places, both while doing an initial "force all"—
This is what it looks like on the cowbuilder-update side for me: https://gist.github.com/mikepurvis/8715769
Even with just a dozen or so jobs, there seem to be a lot of conflicts on startup—which happens every time you make a release, since you have to bounce the buildbot master.
@mikepurvis https://github.com/mikeferguson/buildbot-ros/commit/d319f61a37eb6dbc56b4ec20861a20f8a37b7a18 should avoid the conflicts in the cowbuilder updates. There is a potential problem here if too many builders wait too long, they may get cut off by the buildbot (but that default timeout is like 20 minutes I believe). If this works for you, I'll apply something similar to the reprepro update, which will avoid conflicts there.
Merged and deployed now—so far looks great. Will keep posted.
Okay, still seeing some issues in the includedeb step, too—here's an example:
Yeah, I haven't addressed the includedeb yet.
Which is why the ticket is still open...
The lockfile has stuck for me twice now—haven't figured out yet exactly why it's happening, but will let you know as it becomes clearer.
When you say lockfile, you are talking about the includedeb, right? Not the patch made for the cowbuilder update, right?
No, this is the /tmp/buildbot_precise_amd64_lock
which is created and destroyed by cowbuilder-update.py. When the file is stuck there, all the amd64 builds fail, of course.
Not sure if it's an issue with the particulars of how builds are failing, or a race condition, or something to do with how the reconfig is happening.
For now, I've just had to go in and manually delete the lock to get it going again.
I'm seeing this same issue on trusty - sometimes the /tmp/buildbot_precise_amd64_lock gets left lying around and blocks other builds.
oops, i mean /tmp/buildbot_trusty_amd64_lock
in my case
This is probably for a separate ticket, but I feel like there may be some other tooling out there that would handle many parallel builds more gracefully— specifically Docker for better isolation between builds, and Aptly for better repo management (especially the snapshot feature).
I haven't spent much time debugging this, because long term I'm planning to move to docker. The ROS buildfarm scripts are being rewritten right now using Docker, and Dirk and I already planning to meet up and figure out how we can make buildbot-ros and the jenkins farm use the same underlying docker scripts.
The lock issue isn't a huge problem. If I can track it down and there's an easy short term fix I'll submit a PR. Docker long term sounds like a nice solution for keeping the builds isolated.
No longer supported, about to archive