[JENKINS-21930] Jobs fail due to "node went offline during the build"

timja commented 10 years ago

I keep getting failures on random jobs (see example below), when I take a slave offline.

I thought the purpose of "take slave offline" (versus disconnecting a slave) is that running jobs can continue to run, but no new jobs are started, and I can then disconnect the slave when all jobs are finished (we have a small script which does exactly that, to take a slave out of the cluster).

With the current behaviour, it is impossible to cleanly shutdown a slave.

Expected: Taking a slave offline should NEVER have any impact on any of the jobs running on that slave. They should not even be aware of the fact.

Looks like the node went offline during the build. Check the slave log for the details.FATAL: /var/lib/jenkins/logs/slaves/null/slave.log (No such file or directory)
java.io.FileNotFoundException: /var/lib/jenkins/logs/slaves/null/slave.log (No such file or directory)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.(RandomAccessFile.java:212)
    at org.kohsuke.stapler.framework.io.LargeText$FileSession.(LargeText.java:397)
    at org.kohsuke.stapler.framework.io.LargeText$2.open(LargeText.java:120)
    at org.kohsuke.stapler.framework.io.LargeText.writeLogTo(LargeText.java:210)
    at hudson.console.AnnotatedLargeText.writeHtmlTo(AnnotatedLargeText.java:159)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:605)
    at hudson.model.Run.execute(Run.java:1568)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
    at hudson.model.ResourceController.execute(ResourceController.java:88)
    at hudson.model.Executor.run(Executor.java:236)

Originally reported by marc_guenther, imported from: Jobs fail due to "node went offline during the build"

status: Open
priority: Major
resolution: Unresolved
imported: 2022/01/10

timja commented 10 years ago

danielbeck:

By 'Take slave offline' you mean 'Mark this node temporarily offline', right? What kind of slave is this, JNLP, SSH, ...?

timja commented 10 years ago

danielbeck:

Also, what retention strategy did you select? 'Keep online as much as possible'?

timja commented 10 years ago

marc_guenther:

We have mostly Swarm slaves, and I usually use the jenkins-cli offline-node command, although I might have used the "Mart this node temporarily offline" button once in a while. I am not sure if it also happens on ssh slaves.

timja commented 10 years ago

danielbeck:

Both are the same feature, marking a node offline rather than disconnecting. This is not supposed to cut any connections. Is this reproducible on a pristine Jenkins instance, no plugins etc.?

timja commented 10 years ago

marc_guenther:

I didn't try that, but I just looked at the source code, and found this:

https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/AbstractBuild.java#L538

public Result run(@Nonnull BuildListener listener) throws Exception {
    ....
    Computer c = node.toComputer();
    if (c==null || c.isOffline()) {
// As can be seen in HUDSON-5073, when a build fails because of the slave connectivity problem,
// error message doesn't point users to the slave. So let's do it here.
listener.hyperlink("/computer/"+builtOn+"/log","Looks like the node went offline during the build. Check the slave log for the details.");
    ...

And the isOffline() method also checks for the temporarilyOffline status:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Computer.java#L507

    public boolean isOffline() {
return temporarilyOffline || getChannel()==null;
    }

So, you are correct, the connection is not broken, but the check in this case is wrong. It should check only for the channel, and not the temporarilyOffline status.

timja commented 10 years ago

danielbeck:

Marc: Agreed. Am currently looking into it.

However, this will not fail a build!

timja commented 10 years ago

wdjonsson:

I am also experiencing this issue one some of our Windows Server 2008 nodes. I manually mark the node as offline, and some of them don´t like it (and the configuration for the nodes should be identical). Will update this issue if I find more detailed reproduction steps.

My stacktrace is slightly different, instead of null I have a valid computername, like so:
Looks like the node went offline during the build. Check the slave log for the details.FATAL: /var/lib/jenkins/logs/slaves/EC2-W8S-01/slave.log (No such file or directory)

EC2-W8S-01 being a node name.

timja commented 10 years ago

danielbeck:

wdjonsson: Could you please specify what issue you're experiencing? Bogus messages in the log, or build failures because the node went offline?

timja commented 9 years ago

damien_coraboeuf:

Hi,

We experiment this issue on Jenkins 1.584.

The scenario is the following:

we have a job which puts a node offline in order to run some maintenance on it, using the computer.setTemporarilyOffline(true, new OfflineCause.UserCause(User.current(), "Putting the slave offline for maintenance purpose")) code
some jobs are still running on this node
some of those still running jobs are then failing because of the error mentioned in this issue - not always, but time to time

Do you know if a workaround or solution has been found for this problem?

Thanks,
Damien.

timja commented 9 years ago

damien_coraboeuf:

It seems the code mentioned above does not exist any longer. Might it be that it has been solved for another ticket?

timja commented 9 years ago

damien_coraboeuf:

After analysis of the stack trace and of the code of 1.584, it happens after the run is complete, when the AbstractBuildRunner attempts to write the annotated log.

timja commented 2 years ago

[Originally related to: JENKINS-24123]

timja / jenkins-gh-issues-poc-06-18

[JENKINS-21930] Jobs fail due to "node went offline during the build" #11022