timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-25698] Archiving artifacts (or FilePath operatons) should properly time out and retry the copy #5651

Open timja opened 9 years ago

timja commented 9 years ago

I have two related requests regarding how the artifacts are copied, assuming that no proper syncing will be implemented any time soon:

1. The copying process when archiving artifacts should impose a timeout on network operations. If nothing has been sent for some period of time, it should break the connection. What happens now is that when the connection stalls, our builds run forever or until someone cancels it manually.

2. When copying fails, it should retry (perhaps after waiting a moment.) This plays well with the timeout because it means that when a timeout occurs, the copy fails and gets retried.

Remember, the first fallacy of distributed computing is "The network is reliable." So you can't assume a single attempt to copy the data will always work.


Originally reported by trejkaz, imported from: Archiving artifacts (or FilePath operatons) should properly time out and retry the copy
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 9 years ago

ikedam:

Newtwork operations in copyartifact-plugin depends on FilePath provided by Jenkins core (I added core to components).

Though I'm not sure how Jenkins core manages network connections between a master and slaves, I suppose that a file transfer hangup means disconnection between the master and the slave, and Jenkins can no longer continue the job.

Anyway, let us know how to reproduce the problem if you know.
Or, please get threaddumps (both for the master and the slave) if you face the same problem again.
That will be a great help for us to see what happens internally.

timja commented 9 years ago

trejkaz:

Ah, if it goes over the same connection as the other traffic then maybe this is a wider (core) issue. I'll make a note to get thread dumps the next time it happens (assuming I get to it before someone else).

Actually it looks we do have the master's stack because it appeared in the logs. Looking at this closer I now feel like I should have submitted this directly against core... For some reason I was under the impression that the copy artifacts plugin was responsible for the copy in both directions. :S

java.io.IOException: java.io.IOException: Failed to extract /opt/data/jenkins/workspace/trunk-compile/os/debian/transfer of 13213 files
    at hudson.FilePath.readFromTar(FilePath.java:2286)
    at hudson.FilePath.copyRecursiveTo(FilePath.java:2197)
    at jenkins.model.StandardArtifactManager.archive(StandardArtifactManager.java:61)
    at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:218)
    at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:74)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
    at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
    at hudson.model.Build$BuildExecution.post2(Build.java:183)
    at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
    at hudson.model.Run.execute(Run.java:1784)
    at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
    at hudson.model.ResourceController.execute(ResourceController.java:89)
    at hudson.model.Executor.run(Executor.java:240)
Caused by: java.io.IOException: java.lang.InterruptedException
    at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:177)
    at hudson.util.HeadBufferingStream.read(HeadBufferingStream.java:61)
    at com.jcraft.jzlib.InflaterInputStream.fill(InflaterInputStream.java:175)
    at com.jcraft.jzlib.InflaterInputStream.read(InflaterInputStream.java:106)
    at org.apache.tools.tar.TarBuffer.readBlock(TarBuffer.java:257)
    at org.apache.tools.tar.TarBuffer.readRecord(TarBuffer.java:223)
    at hudson.org.apache.tools.tar.TarInputStream.read(TarInputStream.java:345)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
    at hudson.util.IOUtils.copy(IOUtils.java:40)
    at hudson.FilePath.readFromTar(FilePath.java:2276)
    ... 13 more
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
    ... 25 more

    at hudson.FilePath.copyRecursiveTo(FilePath.java:2204)
    at jenkins.model.StandardArtifactManager.archive(StandardArtifactManager.java:61)
    at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:218)
    at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:74)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
    at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
    at hudson.model.Build$BuildExecution.post2(Build.java:183)
    at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
    at hudson.model.Run.execute(Run.java:1784)
    at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
    at hudson.model.ResourceController.execute(ResourceController.java:89)
    at hudson.model.Executor.run(Executor.java:240)
Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.Request$1.get(Request.java:274)
    at hudson.remoting.Request$1.get(Request.java:206)
    at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
    at hudson.FilePath.copyRecursiveTo(FilePath.java:2200)
    ... 12 more
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.Request.abort(Request.java:295)
    at hudson.remoting.Channel.terminate(Channel.java:814)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
Caused by: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
    at java.io.ObjectInputStream.(ObjectInputStream.java:299)
    at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:40)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
timja commented 9 years ago

danielbeck:

This may also help:
https://wiki.jenkins-ci.org/display/JENKINS/Remoting+issue

timja commented 9 years ago

trejkaz:

My memory of diagnosing the networking issues for some other issue previously is that it turned out our network was just unreliable. We have already gone to extreme lengths to do stuff about that - we replaced the entire Windows master with a Linux master and moved all the builds across to that, but the issues persist and it appears to be unrelated to which kind of slave is being contacted.

So at the moment, we're assuming it's just the nature of networks being unreliable.

The core problem from our point of view is Jenkins' reaction to this sort of thing.

What should happen:

What actually happens:

The first fallacy of distributed computing is "The network is reliable." Jenkins' behaviour today embraces this fallacy quite strongly and we're basically saying that we think it would be better if it didn't make that sort of assumption. The network is not reliable, and Jenkins is doing a pretty bad job of coping with that.

(As an aside, we also think that Jenkins causing failed builds just because jobs after a "successful" build failed is unreasonable, as it means fingers get pointed as us, when it's simply not us to blame. There should be a different status for "the build system fucked up" so that we don't get prodded via IM every time Jenkins screws something up. I'm sick of getting floods of "you broke the build" and "you fixed the build" notices which are simply not true because the build itself was never broken in the first place.)

timja commented 9 years ago

danielbeck:

There are plugins that can tolerate master/slave disconnects during job execution.

Note that there are already issues in this tracker requesting more network issue tolerance. If this is just going to be the same issue, this should be resolved as duplicate.

timja commented 9 years ago

trejkaz:

It was reported as a specific issue, just in case it's possible to fix this one issue without fixing the whole core of Jenkins. Well over 90% of our "build failures" (actually Jenkins-fails, not build failures) happen during the copying of files in one direction or the other, so even fixing this one issue would improve things enormously. (Hell will probably freeze over before Jenkins devs start to care about reliability of the architecture itself. Many years have passed with no improvements in that area whatsoever, so I have no reason to start believing in it at this point in time.)

If there are any plugins which can tolerate disconnects during job execution, they appear to be missing from the list. https://wiki.jenkins-ci.org/display/JENKINS/Plugins#Plugins-Slavelaunchersandcontrollers

timja commented 9 years ago

danielbeck:

Commercial Jenkins Enterprise has http://jenkins-enterprise.cloudbees.com/docs/user-guide-bundle/long-running-build.html
The upcoming https://github.com/jenkinsci/workflow-plugin (no wiki yet) also supports connection aborts.
https://wiki.jenkins-ci.org/display/JENKINS/Durable+Task+Plugin can be used by plugins to add support for more resilient builds

Admittedly, they're all far from mature, or transparent to users. It's a rather recent effort.

timja commented 9 years ago

d3matt:

Just ran into this on my cluster... Adding an attachment. Job "Probe-OAM" is running on "plat_builds" and is stuck in archiving artifacts. Let me know if there's anything other than threadDump you want me to grab

timja commented 9 years ago

d3matt:

Here's the traceback when it finally timed out:

Archiving artifacts
ERROR: Failed to archive artifacts: probe/install/IRIS-*
java.io.IOException: java.io.IOException: Failed to extract /buildarea/jenkins/workspace/Probe-OAM/transfer of 4 files
    at hudson.FilePath.readFromTar(FilePath.java:2103)
    at hudson.FilePath.copyRecursiveTo(FilePath.java:2015)
    at jenkins.model.StandardArtifactManager.archive(StandardArtifactManager.java:61)
    at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:217)
    at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:74)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
    at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
    at hudson.model.Build$BuildExecution.post2(Build.java:183)
    at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
    at hudson.model.Run.execute(Run.java:1784)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:89)
    at hudson.model.Executor.run(Executor.java:240)
Caused by: java.io.IOException: java.lang.InterruptedException
    at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:177)
    at hudson.util.HeadBufferingStream.read(HeadBufferingStream.java:61)
    at com.jcraft.jzlib.InflaterInputStream.fill(InflaterInputStream.java:175)
    at com.jcraft.jzlib.InflaterInputStream.read(InflaterInputStream.java:106)
    at org.apache.tools.tar.TarBuffer.readBlock(TarBuffer.java:257)
    at org.apache.tools.tar.TarBuffer.readRecord(TarBuffer.java:223)
    at hudson.org.apache.tools.tar.TarInputStream.read(TarInputStream.java:345)
    at java.io.FilterInputStream.read(FilterInputStream.java:90)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
    at hudson.util.IOUtils.copy(IOUtils.java:40)
    at hudson.FilePath.readFromTar(FilePath.java:2093)
    ... 13 more
Caused by: java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:175)
    ... 25 more

    at hudson.FilePath.copyRecursiveTo(FilePath.java:2022)
    at jenkins.model.StandardArtifactManager.archive(StandardArtifactManager.java:61)
    at hudson.tasks.ArtifactArchiver.perform(ArtifactArchiver.java:217)
    at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:74)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
    at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
    at hudson.model.Build$BuildExecution.post2(Build.java:183)
    at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
    at hudson.model.Run.execute(Run.java:1784)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:89)
    at hudson.model.Executor.run(Executor.java:240)
Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown: java.util.concurrent.TimeoutException: Ping started on 1417484041756 hasn't completed at 1417484281756
    at hudson.remoting.Request$1.get(Request.java:278)
    at hudson.remoting.Request$1.get(Request.java:210)
    at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
    at hudson.FilePath.copyRecursiveTo(FilePath.java:2018)
    ... 12 more
Caused by: hudson.remoting.RequestAbortedException: hudson.remoting.Channel$OrderlyShutdown: java.util.concurrent.TimeoutException: Ping started on 1417484041756 hasn't completed at 1417484281756
    at hudson.remoting.Request.abort(Request.java:299)
    at hudson.remoting.Channel.terminate(Channel.java:805)
    at hudson.remoting.Channel$CloseCommand.execute(Channel.java:954)
    at hudson.remoting.Channel$2.handle(Channel.java:474)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:60)
Caused by: hudson.remoting.Channel$OrderlyShutdown: java.util.concurrent.TimeoutException: Ping started on 1417484041756 hasn't completed at 1417484281756
    ... 3 more
Caused by: Command close created at
    at hudson.remoting.Command.(Command.java:56)
    at hudson.remoting.Channel$CloseCommand.(Channel.java:948)
    at hudson.remoting.Channel$CloseCommand.(Channel.java:946)
    at hudson.remoting.Channel.close(Channel.java:1029)
    at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:114)
    at hudson.remoting.PingThread.ping(PingThread.java:120)
    at hudson.remoting.PingThread.run(PingThread.java:81)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1417484041756 hasn't completed at 1417484281756
    ... 2 more
timja commented 9 years ago

trejkaz:

Attaching my own stack dumps from a stall this morning, in case they're any different. This is before anyone had a chance to kill the build.

timja commented 9 years ago

ikedam:

@d3matt Let me know what do you mean with "timed out".
Did it automatically stop? (or someone clicked the x button?)

timja commented 9 years ago

d3matt:

It did that on it's own, no intervention.

timja commented 9 years ago

ikedam:

@trejkaz What's the hanged job name?

timja commented 9 years ago

trejkaz:

Executor #0 for U-Igglepiggle : executing trunk-compile/os=ubuntu #3556

(we only have one job per slave, so it is easy to deduce from the dump.)

timja commented 9 years ago

ikedam:

Both of stacktraces point here: https://github.com/jenkinsci/remoting/blob/remoting-2.48/src/main/java/hudson/remoting/FastPipedInputStream.java#L175
There is a timeout (default 10 secs) and there may be an infinite loop calling that.

timja commented 9 years ago

ikedam:

Oh, there's an infinite loop just before that.
It means the opposite output stream isn't closed. Who should close it?

timja commented 9 years ago

danielbeck:

Matthew: Your issue looks more like JENKINS-22914. The underlying exception causes are relevant.

This report is EOF, while yours is a forced abort due to ping timeout.

timja commented 7 years ago

danielbeck:

oleg_nenashev Are there changes in remoting since this was reported that make this issue obsolete?

timja commented 6 years ago

simonmartineaugv:

This is still a valid issue. I am using jenkins 2.1.30 and I still have this problem. 

I run very large build that produce many GB of artifacts. Those builds run for hours. It is very frustrating when it come to the end to see the job hanging in the middle of the transfer. I know I can get access the files directly on the builder but then, it means I am running a build that will never be archived and can't be released.

timja commented 2 years ago

[Originally related to: JENKINS-16629]