ros-infrastructure / buildfarm_deployment

Apache License 2.0
30 stars 39 forks source link

Some jobs on build.ros.org seem to be failing with no apparent reason #206

Closed clalancette closed 4 years ago

clalancette commented 6 years ago

There have been a handful of jobs on build.ros.org that have started failing in the last day. Investigation into the console logs show no apparent cause; examples are:

All seem to have failed while installing packages through apt, but there are no (apparent) failures listed in the apt logs. @nuclearsandwich FYI

nuclearsandwich commented 6 years ago

We talked about this a bit offline and I don't have any leads. If it's observed in the future we should try to do some forensics on the node before it gets culled. All of the instances @clalancette reported were Debian Stretch which would have been a correlation but @mikaelarguedas's example (now added to the issue body) was for a doc job running on Bionic.

nuclearsandwich commented 6 years ago

I saw another one of these today and I think they're the result of a "graceful" death when a node is scaled in. In the past we've seen big nasty connection failure stacktraces but those were usually to unplanned node losses as opposed to a node being intentionally shutdown by an over-eager scale in metric.

tfoote commented 4 years ago

It's been a long time since we have seen this. And the links are for the previous generation of the jenkins server. So I'm going to close this. We've also made our scaling in policy more conservative.