[JENKINS-46067] Pipeline task scheduled on uninitialized node

timja commented 7 years ago

One of our nodes disconnected due to a channel disconnect. Meanwhile a pipeline job was running, and as soon as it reconnected, Jenkins tried to schedule a task on the node, which failed with the error:

ERROR: Issue with creating launcher for agent . The agent has not been fully initialized yet

I have attached the console output from the agent.

Originally reported by benh57, imported from: Pipeline task scheduled on uninitialized node

status: Open
priority: Major
resolution: Unresolved
imported: 2022/01/10

timja commented 7 years ago

oleg_nenashev:

It will be likely fixed by ~~JENKINS-45023~~. Or you get a failure early at least.

timja commented 6 years ago

oleg_nenashev:

benh57 Please retest the issue on Remoting 3.14

timja commented 6 years ago

johnlengeling:

We upgraded to Jenkins LTS 2.89.1 which is supposed to have Remoting 3.14 and we are seeing some builds fail with the same error message. These are elastic Openstack build nodes. We are also using a Slave Setup - Setup Script After Copy script too if that matters.

ERROR: Issue with creating launcher for agent tph-build-c1.2xlarge-8428. The agent has not been fully initialized yet
ERROR: Issue with creating launcher for agent tph-build-c1.2xlarge-8428. The agent has not been fully initialized yet
remote file operation failed: /opt/jenkins/workspace/fh_guide_master-DAUIRXWDCG4RH3OVIMVYS3W2QDGGAIJIWFEPCAIQEFWE7YG5KZDQ at hudson.remoting.Channel@681ae622:tph-build-c1.2xlarge-8428: hudson.remoting.ChannelClosedException: Remote call on tph-build-c1.2xlarge-8428 failed. The channel is closing down or has closed down

timja commented 6 years ago

oleg_nenashev:

johnlengeling would you be able to attach...

1) Jenkins master logs
2) Agent logs
3) OpenStack VM provisioniing/termination logs
4) Agent configuration and the Openstack plugin versions (if any)

timja commented 6 years ago

johnlengeling:

Oleg,

We are running Jenkins 2.89.1 with Openstack Plugin 2.29. We are running a declarative pipeline job. We have a parallel section which runs 7 different build steps on 7 different nodes that are spun up elastically on Openstack Ocata. We see this failure intermittently.

We see the following error in the console output:

[beautify none-noarch] Still waiting to schedule task
[beautify none-noarch] All nodes of label ‘openstack’ are offline
[beautify none-noarch] Running on team-ph-build-c1.2xlarge-738 in /opt/jenkins/workspace/h_foo_master-DAUIRXWDCG4RH3OVIMVYS3W2QDGGAIJIWFEPCAIQEFWE7YG5KZDQ
[Pipeline] [beautify none-noarch] {
[Pipeline] [beautify none-noarch] checkout
[beautify none-noarch] ERROR: Issue with creating launcher for agent team-ph-build-c1.2xlarge-738. The agent has not been fully initialized yet
[beautify none-noarch] ERROR: Issue with creating launcher for agent team-ph-build-c1.2xlarge-738. The agent has not been fully initialized yet

And the following stack trace at the end of the console log:

java.io.EOFException
 at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675)
 at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150)
 at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859)
 at java.io.ObjectInputStream.(ObjectInputStream.java:355)
 at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:48)
 at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused: hudson.remoting.ChannelClosedException: Remote call on team-ph-build-c1.2xlarge-738 failed. The channel is closing down or has closed down
 at hudson.remoting.Channel.call(Channel.java:901)
 at hudson.FilePath.act(FilePath.java:986)
Caused: java.io.IOException: remote file operation failed: /opt/jenkins/workspace/h_foo_master-DAUIRXWDCG4RH3OVIMVYS3W2QDGGAIJIWFEPCAIQEFWE7YG5KZDQ at hudson.remoting.Channel@2d334044:team-ph-build-c1.2xlarge-738
 at hudson.FilePath.act(FilePath.java:993)
 at hudson.FilePath.act(FilePath.java:975)
 at hudson.FilePath.mkdirs(FilePath.java:1158)
 at hudson.plugins.git.GitSCM.createClient(GitSCM.java:747)
 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1117)
 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:113)
 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:85)
 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:75)
 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47)
 at hudson.security.ACL.impersonate(ACL.java:260)
 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Finished: FAILURE

Attached is an excerpt of the catalina.out and also the node log showing the Agent Setup script running.

timja commented 6 years ago

starfire_:

We have the same problem on Jenkins 2.60.3, host system is Debian 9, with KVM/QEMU and the libvirt plugin. OS on the VMs is Windows 7 with the Jenkins service installed. We have configured the VMs to shutdown after each job and to revert to a certain snapshot when shutting down.

We have noticed that the machines do not shut down every time, especially if there is heavy load (lots of jobs in the queue) or if the job failed during compilation. And if they do not shut down, this error appears

timja commented 6 years ago

sames:

As a workaround, does anyone know how to setup the retry so that it will try to use a different node? I can setup some retries around functions executing on the node (e.g.)

jenkinsObject.retry(3)
{
jenkinsObject.doTheThing() // Do a clever thing on this node
}

However I'm not sure how to setup a retry around a failed node. Infrastructure fails sometimes. Fact of life. I know there's gotta be a way to compensate for that so that my Jenkins jobs are a little more robust, but I'm not seeing it...

timja commented 6 years ago

fnaum:

Somehow this is happening more often in our instance Jenkins 2.116 with the latest pipeline plugins. (Note we have not any OpenStack* plugin installed)

I have attached the master and slave jobs

jenkins.master.trimmed.log

slave.log.1[^jenkins.log]

^{At this stage, I'm with sames I'm happy with any workaround}

timja commented 6 years ago

ajferrigno:

We just got this exact error for the first time in one of our pipeline jobs running many things in parallel. It stinks because there is no recovery whatsoever, and even the "post { always {}}" block did not execute. This worries me because it means there is potential for jobs not cleaning up after themselves.

I am trying to determine whether this is supposed to be fixed, but I can't tell. Can't the underlying code just wait a couple seconds for the slave to become fully available, rather than crash the whole job? The referenced ticket (45023) seems to somewhat address that, but it seems to not take effect in my installation.

We are using Jenkins version 2.130 with relatively recent pipeline plugins (pipline:api is v 2.28, pipeline:job is 2.22, pipeline:multibranch is 2.19, pipeline:utility is v2.1.0). Nothing was upgraded recently, and this same job has worked many times in the past.

One possible solution I can see is that our Jenkins nodes are too overloaded, causing them to shut down in the first place. So maybe we should add more nodes?

timja commented 6 years ago

jthompson:

ajferrigno, it would be wise to make sure none of the nodes are overloaded. All sorts of unusual symptoms can occur when systems are overloaded. An overloaded system can cause a node or a connection to fail. Perhaps something needs to be more robust to these conditions, but reducing the load is a good step.

timja commented 6 years ago

jthompson:

It sounds like there are possibly a few different, though interacting problems going on here.

Agents / nodes fail or otherwise lose their connection at times. I don't see enough information here to suggest any causes, other than system issues such as overload or other lack of resources.

Some portions of the system could be more resilient in the face of such failures. Pipeline, for one example, could better handle these failures or provide more tools for handling them.

There may be some issue with sequencing of nodes starting up and other things interacting with them, such as pipeline. Again, there isn't sufficient information to determine what the problematic sequencing might be.

Possibly other issues, also.

timja commented 5 years ago

fnaum:

Just to mention that in the past I did monitor the cpu, memory,I/o, network for the nodes to check if there was an overload and found that this issue happens even when there is almost no load in the node.

timja commented 5 years ago

hazok:

Noticed this issue started occurring after upgrading to 2.138.1 and the latest plugins a couple weeks ago. Reliability has gone down significantly after the update. Voting for this issue.

timja commented 5 years ago

jglick:

johnlengeling’s issue sounds like a garden-variety Remoting channel outage, which could have any of a host of root causes. Other reports sound a bit like ~~JENKINS-41854~~, though there is not enough detail to tell. I was not familiar with that particular error message; from full-text search I can find this (plus an apparently copied usage in a much more obscure plugin).

timja commented 5 years ago

jglick:

If reporter/commenters are running the Docker Slaves plugin, that may be to blame. I do not think this plugin is maintained. CC ndeloof

timja commented 5 years ago

wreggyl:

we also saw this kind of issue with the AWS EC2 plugin by creating a new EC2 instance/slave on the fly - most of the times we get this error on windows machine when the slave is started via WinRM - on unix machines where the slave is started via SSH we never saw this issue

timja commented 5 years ago

rg:

We saw this but are not using Docker Slaves or AWS EC2 plugin. We are using windows machines connected via JNLP.

timja commented 5 years ago

wreggyl:

rg How did you start the slave? Everytime when we saw this issue we had it with automatic and newly created slaves, never with the slave we create manually and start via Java Web or via cmd -> also the newly created slaves via Swarm Plugin didn't cause this issue, only the slaves created on the fly by the EC2 plugin

timja commented 5 years ago

rg:

I don't know for sure whether it was Java Web or some other way, but definitely a permanent node. Is it possible that it happens more with automatically created agents as they are added/removed more frequently? It's not something we see frequently so we could have just been unlucky!

timja commented 5 years ago

jglick:

Never mind, I missed a usage of this message pattern in Jenkins core in Slave.reportLauncherCreateError.

timja commented 5 years ago

jglick:

Tracking to oleg_nenashev’s changes for ~~JENKINS-38527~~, I found the older ~~JENKINS-23305~~ which sounds like it is tracking a similar problem. I have no clue as to the root cause here. It should go without saying that if anyone knows how to reproduce the error from scratch they should speak up. Otherwise I am not sure whether there is any way to proceed.

timja commented 5 years ago

jglick:

The stack traces of the form

java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
    at hudson.model.Slave.reportLauncherCreateError(Slave.java:524)
    at hudson.model.Slave.createLauncher(Slave.java:496)
    at org.jenkinsci.plugins.workflow.support.DefaultStepContext.makeLauncher(DefaultStepContext.java:112)
    at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:68)
    at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258)
    at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206)
    at …

suggest that the problem arises when starting a sh, checkout, or similar step inside a node block. That would imply that this is a duplicate of ~~JENKINS-41854~~. No idea why the frequency of that issue might have changed recently.

timja commented 5 years ago

caley:

I am seeing this issue consistently when trying to use the post.aborted.container['slave'] below.

  post {
    aborted { 
echo "testaborted"
container('slave') {
   sh "ls"
}
    }
    failure {
echo "testfail"
container('slave') {
   sh "ls" 
}
    }
}

What is interesting is I can use the post.failure.container['slave'] above all day long and I can spin up a container and exit cleanly.

Jenkins 2.147 Kubernetes Plugin 1.13.6

[Pipeline] withCredentials
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
[Pipeline] {
[Pipeline] sh
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
[Pipeline] }
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
ERROR: Issue with creating launcher for agent multi-image-1l98r-vc3rk. The agent has not been fully initialized yet
[Pipeline] // withCredentials
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // waitUntil
[Pipeline] }
[Pipeline] // timeout
Error when executing aborted post condition:
hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.255.1.162/10.255.1.162:11312 failed. The channel is closing down or has closed down
    at hudson.remoting.Channel.call(Channel.java:948)
    at hudson.FilePath.act(FilePath.java:1070)
    at hudson.FilePath.act(FilePath.java:1059)
    at hudson.FilePath.mkdirs(FilePath.java:1244)
    at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.(FileMonitoringTask.java:171)
    at org.jenkinsci.plugins.durabletask.BourneShellScript$ShellController.(BourneShellScript.java:197)
    at org.jenkinsci.plugins.durabletask.BourneShellScript$ShellController.(BourneShellScript.java:189)
    at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:110)
    at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:98)
    at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:264)
    at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:270)
    at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:178)
    at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
    at sun.reflect.GeneratedMethodAccessor969.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
    at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
    at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
    at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
    at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:155)
    at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
    at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
    at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
    at WorkflowScript.run(WorkflowScript:198)
    at ___cps.transform___(Native Method)
    at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
    at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
    at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
    at sun.reflect.GeneratedMethodAccessor516.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
    at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
    at com.cloudbees.groovy.cps.Next.step(Next.java:83)
    at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
    at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
    at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
    at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
    at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
    at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$101(SandboxContinuable.java:34)
    at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.lambda$run0$0(SandboxContinuable.java:59)
    at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
    at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:58)
    at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:182)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
    at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
    at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
    at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
    at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
    ... 3 more

timja commented 5 years ago

benh57:

We are relatively updated now, and hit this again when restarting our agents. Jenkins 2.154 / Remoting 3.27. The node had just connected via JNLP.

For us i think it would be pretty easy to reproduce if we made a task that continually restarts agents while a job is scheduled via pipeline to run on them.
Cannot contact : hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from /:56650 failed. The channel is closing down or has closed down

timja commented 2 years ago

[Originally related to: JENKINS-45023]

timja / jenkins-gh-issues-poc-06-18

[JENKINS-46067] Pipeline task scheduled on uninitialized node #9296