Open james-powis opened 6 years ago
@james-powis Can you provide details of the docker cloud configuration you're using? Is it possible that you've got a docker template defined with the same image? If that's the case then this issue is the same as #655, and it would be very useful to confirm that.
Actually yes, we are using cloud (global) templates due to job specific templates having all sorts of issues (never could get it to work, nor could I figure out what useful detail to provide in a issue)...
Gut feeling seems like there is deep conflicts between the global cloud provider config and its templates and the build job ones... Leaving all but the Restrict where this project can be run - Label Expression
empty and relying on the global config and its templates was the only way I could get it to work with any measure of reliability.
Off topic... but would you like to join in my campaign of eradicating XML from the face of the earth?
OK here is the relevant section of config.xml
clouds block (sanitized for my comfort)
<clouds>
<com.nirima.jenkins.plugins.docker.DockerCloud plugin="docker-plugin@1.1.4">
<name>apollo00</name>
<templates>
<com.nirima.jenkins.plugins.docker.DockerTemplate>
<configVersion>2</configVersion>
<labelString></labelString>
<connector class="io.jenkins.docker.connector.DockerComputerSSHConnector">
<sshKeyStrategy class="io.jenkins.docker.connector.DockerComputerSSHConnector$InjectSSHKey">
<user>jenkins</user>
</sshKeyStrategy>
<port>22</port>
<jvmOptions></jvmOptions>
<javaPath></javaPath>
<prefixStartSlaveCmd></prefixStartSlaveCmd>
<suffixStartSlaveCmd></suffixStartSlaveCmd>
<maxNumRetries>2</maxNumRetries>
<retryWaitTime>10</retryWaitTime>
</connector>
<remoteFs>/home/jenkins</remoteFs>
<instanceCap>2147483647</instanceCap>
<mode>NORMAL</mode>
<retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
<idleMinutes>10</idleMinutes>
</retentionStrategy>
<dockerTemplateBase>
<image>1.2.3.4:4567/docker/jenkins-docker-slave:latest</image>
<pullCredentialsId>scrubbed</pullCredentialsId>
<dockerCommand></dockerCommand>
<hostname></hostname>
<dnsHosts/>
<network></network>
<volumes/>
<volumesFrom2/>
<environment/>
<bindPorts></bindPorts>
<bindAllPorts>false</bindAllPorts>
<privileged>false</privileged>
<tty>false</tty>
<extraHosts class="empty-list"/>
</dockerTemplateBase>
<removeVolumes>true</removeVolumes>
<pullStrategy>PULL_ALWAYS</pullStrategy>
<pullTimeout>0</pullTimeout>
<nodeProperties class="empty-list"/>
<disabled>
<disabledByChoice>false</disabledByChoice>
</disabled>
</com.nirima.jenkins.plugins.docker.DockerTemplate>
<com.nirima.jenkins.plugins.docker.DockerTemplate>
<configVersion>2</configVersion>
<labelString>robot-framework</labelString>
<connector class="io.jenkins.docker.connector.DockerComputerAttachConnector">
<user>root</user>
</connector>
<remoteFs>/home/jenkins</remoteFs>
<instanceCap>2147483647</instanceCap>
<mode>EXCLUSIVE</mode>
<retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
<idleMinutes>10</idleMinutes>
</retentionStrategy>
<dockerTemplateBase>
<image>1.2.3.4:4567/docker/robot_framework_alpine:latest</image>
<pullCredentialsId>scrubbed</pullCredentialsId>
<dockerCommand></dockerCommand>
<hostname></hostname>
<dnsHosts/>
<network>ac79a17f4e54</network>
<volumes>
<string>/var/run/docker.sock:/var/run/docker.sock</string>
</volumes>
<volumesFrom2/>
<environment/>
<bindPorts></bindPorts>
<bindAllPorts>true</bindAllPorts>
<privileged>false</privileged>
<tty>false</tty>
<extraHosts class="empty-list"/>
</dockerTemplateBase>
<removeVolumes>true</removeVolumes>
<pullStrategy>PULL_ALWAYS</pullStrategy>
<pullTimeout>0</pullTimeout>
<nodeProperties class="empty-list"/>
<disabled>
<disabledByChoice>false</disabledByChoice>
</disabled>
</com.nirima.jenkins.plugins.docker.DockerTemplate>
</templates>
<dockerApi>
<dockerHost plugin="docker-commons@1.13">
<uri>unix:///var/run/docker.sock</uri>
</dockerHost>
<connectTimeout>60</connectTimeout>
<readTimeout>0</readTimeout>
<hostname>1.2.3.4</hostname>
</dockerApi>
<containerCap>2147483647</containerCap>
<exposeDockerHost>true</exposeDockerHost>
<disabled>
<disabledByChoice>false</disabledByChoice>
</disabled>
</com.nirima.jenkins.plugins.docker.DockerCloud>
</clouds>
So that config.xml has its own images, and none of those are the same as the image in the job.xml - is that correct? Or does your current version of your job.xml contain the same image names?
FYI I suspect that the code is getting confused over usage counts because I think it's using the image name as a unique ID to map container to template, which won't work when there's job-specific templates with the same image name as a cloud template's image name...
pjdarton If I were you I would think the same thing... however in my case I started by defining the templates in the job, only when that did not work, I configured the global config way.
To demonstrate this I created a new job which uses a public image (jenkinsci/jnlp-slave) which is not defined anywhere in the global config. This job sleeps for 60 seconds before exiting. Here is the resulting log:
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave
INFO: Provisioning 'jenkinsci/jnlp-slave' number 1 (of 1) on 'apollo00'
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Will provision 'jenkinsci/jnlp-slave', for label: 'cf61da4a-7771-4574-8e75-c49025586731', in cloud: 'apollo00'
Aug 08, 2018 10:35:26 AM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
INFO: Started provisioning Image of jenkinsci/jnlp-slave from apollo00 with 1 executors. Remaining excess workload: 0
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
INFO: Pulling image 'jenkinsci/jnlp-slave:latest'. This may take awhile...
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
INFO: Finished pulling image 'jenkinsci/jnlp-slave:latest', took 759 ms
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Trying to run container for jenkinsci/jnlp-slave
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Trying to run container for node 2a5808028124ef from image: jenkinsci/jnlp-slave
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Not provisioning additional slaves for cf61da4a-7771-4574-8e75-c49025586731; we have 1 executors being started already
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Not provisioning additional slaves for cf61da4a-7771-4574-8e75-c49025586731; we have 1 executors being started already
Aug 08, 2018 10:35:31 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Started container ID 86a0f0e203d2d17cd7faa5aaa37e19084243340901181cc83056daf9483fe8e0 for node 2a5808028124ef from image: jenkinsci/jnlp-slave
Aug 08, 2018 10:35:34 AM hudson.model.AbstractCIBase updateComputer
WARNING: Node apollo00-vm has no executors. Cannot update the Computer instance of it
channel started
<<<<<<< HERE IS WHERE I TRIGGERED THE SECOND RUN WHILE THE FIRST WAS RUNNING >>>>>>
Aug 08, 2018 10:35:42 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: aa2a73b4-cb94-45ca-b016-4d2fa62cd81e
Aug 08, 2018 10:35:42 AM com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave
INFO: Not Provisioning 'jenkinsci/jnlp-slave'. Template instance limit of '1' reached on cloud 'apollo00'
and the associated config for the job
10:39 /srv/docker/jenkins/jobs/whalesay$ cat config.xml
<?xml version='1.1' encoding='UTF-8'?>
<project>
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<com.nirima.jenkins.plugins.docker.DockerJobTemplateProperty plugin="docker-plugin@1.1.4">
<cloudname>apollo00</cloudname>
<template>
<configVersion>2</configVersion>
<labelString></labelString>
<connector class="io.jenkins.docker.connector.DockerComputerAttachConnector">
<user></user>
</connector>
<remoteFs></remoteFs>
<instanceCap>2147483647</instanceCap>
<mode>NORMAL</mode>
<retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
<idleMinutes>10</idleMinutes>
</retentionStrategy>
<dockerTemplateBase>
<image>jenkinsci/jnlp-slave</image>
<pullCredentialsId></pullCredentialsId>
<dockerCommand></dockerCommand>
<hostname></hostname>
<dnsHosts/>
<network></network>
<volumes/>
<volumesFrom2/>
<environment/>
<bindPorts></bindPorts>
<bindAllPorts>false</bindAllPorts>
<privileged>false</privileged>
<tty>false</tty>
<extraHosts class="empty-list"/>
</dockerTemplateBase>
<removeVolumes>false</removeVolumes>
<pullStrategy>PULL_ALWAYS</pullStrategy>
<pullTimeout>300</pullTimeout>
<nodeProperties class="empty-list"/>
<disabled>
<disabledByChoice>false</disabledByChoice>
</disabled>
</template>
</com.nirima.jenkins.plugins.docker.DockerJobTemplateProperty>
<com.dabsquared.gitlabjenkins.connection.GitLabConnectionProperty plugin="gitlab-plugin@1.5.6">
<gitLabConnection>apollo00</gitLabConnection>
</com.dabsquared.gitlabjenkins.connection.GitLabConnectionProperty>
<com.sonyericsson.rebuild.RebuildSettings plugin="rebuild@1.28">
<autoRebuild>false</autoRebuild>
<rebuildDisabled>false</rebuildDisabled>
</com.sonyericsson.rebuild.RebuildSettings>
</properties>
<scm class="hudson.scm.NullSCM"/>
<canRoam>true</canRoam>
<disabled>false</disabled>
<blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
<blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
<triggers/>
<concurrentBuild>true</concurrentBuild>
<builders>
<hudson.tasks.Shell>
<command>echo "Hello Beautiful World"
sleep 60
echo "Goodbye Cruel World"
</command>
</hudson.tasks.Shell>
</builders>
<publishers/>
<buildWrappers/>
</project>
Seeing similar issue.
I have only global templates and sometimes it "hangs" and I see that thing in the logs – Not provisioning additional slaves for docker-XXXX; we have 1 executors being started already
Although instanceCap is set to 100.
I've ran into this one to.
My guess is that this comes from the default constructor setting instanceCap to 1: src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java#L105
And, later in the instanceCap algorithm (/src/main/java/com/nirima/jenkins/plugins/docker/DockerCloud.java#L617) , it matches on image name, and then runs into this limit.
I've "fixed" this by hacking into the job templates and changing that value via a groovy system script: https://gist.github.com/glance-/0137aad1e04f9e69b589de63e85206dd
It was a bit tricky due to the final modifier there, but with the reflections hammer it was fixable.
OK this one is a bit weird and I would submit a PR if I even remotely understood why this behavior is different. Long story short the instanceCap setting when defined in an individual build job always defaults to 1 and cannot be overridden, where as when defined globally in the main config it defaults to no cap which can be overridden.
Relevant config used in both job and main configs: