jenkinsci / docker-plugin

Jenkins cloud plugin that uses Docker
https://plugins.jenkins.io/docker-plugin/
MIT License
489 stars 319 forks source link

Template instanceCap is ignored when defined in job config #629

Open james-powis opened 6 years ago

james-powis commented 6 years ago

OK this one is a bit weird and I would submit a PR if I even remotely understood why this behavior is different. Long story short the instanceCap setting when defined in an individual build job always defaults to 1 and cannot be overridden, where as when defined globally in the main config it defaults to no cap which can be overridden.

Relevant config used in both job and main configs:

    <com.nirima.jenkins.plugins.docker.DockerJobTemplateProperty plugin="docker-plugin@1.1.3">
      <cloudname>test</cloudname>
      <template>
        <configVersion>2</configVersion>
        <labelString>robot-framework</labelString>
        <connector class="io.jenkins.docker.connector.DockerComputerAttachConnector">
          <user></user>
        </connector>
        <remoteFs>/home/jenkins</remoteFs>
        <instanceCap></instanceCap>
        <mode>EXCLUSIVE</mode>
        <retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
          <timeout>10</timeout>
        </retentionStrategy>
        <dockerTemplateBase>
          <image>1.2.3.4:4567/docker/robot_framework:jenkins</image>
          <pullCredentialsId></pullCredentialsId>
          <dockerCommand></dockerCommand>
          <hostname></hostname>
          <dnsHosts/>
          <network></network>
          <volumes/>
          <volumesFrom2/>
          <environment/>
          <bindPorts></bindPorts>
          <bindAllPorts>false</bindAllPorts>
          <privileged>false</privileged>
          <tty>false</tty>
          <extraHosts/>
        </dockerTemplateBase>
        <removeVolumes>true</removeVolumes>
        <pullStrategy>PULL_ALWAYS</pullStrategy>
        <nodeProperties class="empty-list"/>
      </template>
pjdarton commented 6 years ago

@james-powis Can you provide details of the docker cloud configuration you're using? Is it possible that you've got a docker template defined with the same image? If that's the case then this issue is the same as #655, and it would be very useful to confirm that.

james-powis commented 6 years ago

Actually yes, we are using cloud (global) templates due to job specific templates having all sorts of issues (never could get it to work, nor could I figure out what useful detail to provide in a issue)...

Gut feeling seems like there is deep conflicts between the global cloud provider config and its templates and the build job ones... Leaving all but the Restrict where this project can be run - Label Expression empty and relying on the global config and its templates was the only way I could get it to work with any measure of reliability.

Off topic... but would you like to join in my campaign of eradicating XML from the face of the earth?

james-powis commented 6 years ago

OK here is the relevant section of config.xml clouds block (sanitized for my comfort)

  <clouds>
    <com.nirima.jenkins.plugins.docker.DockerCloud plugin="docker-plugin@1.1.4">
      <name>apollo00</name>
      <templates>
        <com.nirima.jenkins.plugins.docker.DockerTemplate>
          <configVersion>2</configVersion>
          <labelString></labelString>
          <connector class="io.jenkins.docker.connector.DockerComputerSSHConnector">
            <sshKeyStrategy class="io.jenkins.docker.connector.DockerComputerSSHConnector$InjectSSHKey">
              <user>jenkins</user>
            </sshKeyStrategy>
            <port>22</port>
            <jvmOptions></jvmOptions>
            <javaPath></javaPath>
            <prefixStartSlaveCmd></prefixStartSlaveCmd>
            <suffixStartSlaveCmd></suffixStartSlaveCmd>
            <maxNumRetries>2</maxNumRetries>
            <retryWaitTime>10</retryWaitTime>
          </connector>
          <remoteFs>/home/jenkins</remoteFs>
          <instanceCap>2147483647</instanceCap>
          <mode>NORMAL</mode>
          <retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
            <idleMinutes>10</idleMinutes>
          </retentionStrategy>
          <dockerTemplateBase>
            <image>1.2.3.4:4567/docker/jenkins-docker-slave:latest</image>
            <pullCredentialsId>scrubbed</pullCredentialsId>
            <dockerCommand></dockerCommand>
            <hostname></hostname>
            <dnsHosts/>
            <network></network>
            <volumes/>
            <volumesFrom2/>
            <environment/>
            <bindPorts></bindPorts>
            <bindAllPorts>false</bindAllPorts>
            <privileged>false</privileged>
            <tty>false</tty>
            <extraHosts class="empty-list"/>
          </dockerTemplateBase>
          <removeVolumes>true</removeVolumes>
          <pullStrategy>PULL_ALWAYS</pullStrategy>
          <pullTimeout>0</pullTimeout>
          <nodeProperties class="empty-list"/>
          <disabled>
            <disabledByChoice>false</disabledByChoice>
          </disabled>
        </com.nirima.jenkins.plugins.docker.DockerTemplate>
        <com.nirima.jenkins.plugins.docker.DockerTemplate>
          <configVersion>2</configVersion>
          <labelString>robot-framework</labelString>
          <connector class="io.jenkins.docker.connector.DockerComputerAttachConnector">
            <user>root</user>
          </connector>
          <remoteFs>/home/jenkins</remoteFs>
          <instanceCap>2147483647</instanceCap>
          <mode>EXCLUSIVE</mode>
          <retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
            <idleMinutes>10</idleMinutes>
          </retentionStrategy>
          <dockerTemplateBase>
            <image>1.2.3.4:4567/docker/robot_framework_alpine:latest</image>
            <pullCredentialsId>scrubbed</pullCredentialsId>
            <dockerCommand></dockerCommand>
            <hostname></hostname>
            <dnsHosts/>
            <network>ac79a17f4e54</network>
            <volumes>
              <string>/var/run/docker.sock:/var/run/docker.sock</string>
            </volumes>
            <volumesFrom2/>
            <environment/>
            <bindPorts></bindPorts>
            <bindAllPorts>true</bindAllPorts>
            <privileged>false</privileged>
            <tty>false</tty>
            <extraHosts class="empty-list"/>
          </dockerTemplateBase>
          <removeVolumes>true</removeVolumes>
          <pullStrategy>PULL_ALWAYS</pullStrategy>
          <pullTimeout>0</pullTimeout>
          <nodeProperties class="empty-list"/>
          <disabled>
            <disabledByChoice>false</disabledByChoice>
          </disabled>
        </com.nirima.jenkins.plugins.docker.DockerTemplate>
      </templates>
      <dockerApi>
        <dockerHost plugin="docker-commons@1.13">
          <uri>unix:///var/run/docker.sock</uri>
        </dockerHost>
        <connectTimeout>60</connectTimeout>
        <readTimeout>0</readTimeout>
        <hostname>1.2.3.4</hostname>
      </dockerApi>
      <containerCap>2147483647</containerCap>
      <exposeDockerHost>true</exposeDockerHost>
      <disabled>
        <disabledByChoice>false</disabledByChoice>
      </disabled>
    </com.nirima.jenkins.plugins.docker.DockerCloud>
  </clouds>
pjdarton commented 6 years ago

So that config.xml has its own images, and none of those are the same as the image in the job.xml - is that correct? Or does your current version of your job.xml contain the same image names?

FYI I suspect that the code is getting confused over usage counts because I think it's using the image name as a unique ID to map container to template, which won't work when there's job-specific templates with the same image name as a cloud template's image name...

james-powis commented 6 years ago

pjdarton If I were you I would think the same thing... however in my case I started by defining the templates in the job, only when that did not work, I configured the global config way.

To demonstrate this I created a new job which uses a public image (jenkinsci/jnlp-slave) which is not defined anywhere in the global config. This job sleeps for 60 seconds before exiting. Here is the resulting log:

Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave
INFO: Provisioning 'jenkinsci/jnlp-slave' number 1 (of 1) on 'apollo00'
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Will provision 'jenkinsci/jnlp-slave', for label: 'cf61da4a-7771-4574-8e75-c49025586731', in cloud: 'apollo00'
Aug 08, 2018 10:35:26 AM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
INFO: Started provisioning Image of jenkinsci/jnlp-slave from apollo00 with 1 executors. Remaining excess workload: 0
Aug 08, 2018 10:35:26 AM com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
INFO: Pulling image 'jenkinsci/jnlp-slave:latest'. This may take awhile...
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate pullImage
INFO: Finished pulling image 'jenkinsci/jnlp-slave:latest', took 759 ms
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Trying to run container for jenkinsci/jnlp-slave
Aug 08, 2018 10:35:27 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Trying to run container for node 2a5808028124ef from image: jenkinsci/jnlp-slave
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Not provisioning additional slaves for cf61da4a-7771-4574-8e75-c49025586731; we have 1 executors being started already
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: cf61da4a-7771-4574-8e75-c49025586731
Aug 08, 2018 10:35:29 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Not provisioning additional slaves for cf61da4a-7771-4574-8e75-c49025586731; we have 1 executors being started already
Aug 08, 2018 10:35:31 AM com.nirima.jenkins.plugins.docker.DockerTemplate doProvisionNode
INFO: Started container ID 86a0f0e203d2d17cd7faa5aaa37e19084243340901181cc83056daf9483fe8e0 for node 2a5808028124ef from image: jenkinsci/jnlp-slave
Aug 08, 2018 10:35:34 AM hudson.model.AbstractCIBase updateComputer
WARNING: Node apollo00-vm has no executors. Cannot update the Computer instance of it
channel started
<<<<<<< HERE IS WHERE I TRIGGERED THE SECOND RUN WHILE THE FIRST WAS RUNNING >>>>>>
Aug 08, 2018 10:35:42 AM com.nirima.jenkins.plugins.docker.DockerCloud provision
INFO: Asked to provision 1 slave(s) for: aa2a73b4-cb94-45ca-b016-4d2fa62cd81e
Aug 08, 2018 10:35:42 AM com.nirima.jenkins.plugins.docker.DockerCloud canAddProvisionedSlave
INFO: Not Provisioning 'jenkinsci/jnlp-slave'. Template instance limit of '1' reached on cloud 'apollo00'

and the associated config for the job

10:39  /srv/docker/jenkins/jobs/whalesay$ cat config.xml 
<?xml version='1.1' encoding='UTF-8'?>
<project>
  <actions/>
  <description></description>
  <keepDependencies>false</keepDependencies>
  <properties>
    <com.nirima.jenkins.plugins.docker.DockerJobTemplateProperty plugin="docker-plugin@1.1.4">
      <cloudname>apollo00</cloudname>
      <template>
        <configVersion>2</configVersion>
        <labelString></labelString>
        <connector class="io.jenkins.docker.connector.DockerComputerAttachConnector">
          <user></user>
        </connector>
        <remoteFs></remoteFs>
        <instanceCap>2147483647</instanceCap>
        <mode>NORMAL</mode>
        <retentionStrategy class="com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy">
          <idleMinutes>10</idleMinutes>
        </retentionStrategy>
        <dockerTemplateBase>
          <image>jenkinsci/jnlp-slave</image>
          <pullCredentialsId></pullCredentialsId>
          <dockerCommand></dockerCommand>
          <hostname></hostname>
          <dnsHosts/>
          <network></network>
          <volumes/>
          <volumesFrom2/>
          <environment/>
          <bindPorts></bindPorts>
          <bindAllPorts>false</bindAllPorts>
          <privileged>false</privileged>
          <tty>false</tty>
          <extraHosts class="empty-list"/>
        </dockerTemplateBase>
        <removeVolumes>false</removeVolumes>
        <pullStrategy>PULL_ALWAYS</pullStrategy>
        <pullTimeout>300</pullTimeout>
        <nodeProperties class="empty-list"/>
        <disabled>
          <disabledByChoice>false</disabledByChoice>
        </disabled>
      </template>
    </com.nirima.jenkins.plugins.docker.DockerJobTemplateProperty>
    <com.dabsquared.gitlabjenkins.connection.GitLabConnectionProperty plugin="gitlab-plugin@1.5.6">
      <gitLabConnection>apollo00</gitLabConnection>
    </com.dabsquared.gitlabjenkins.connection.GitLabConnectionProperty>
    <com.sonyericsson.rebuild.RebuildSettings plugin="rebuild@1.28">
      <autoRebuild>false</autoRebuild>
      <rebuildDisabled>false</rebuildDisabled>
    </com.sonyericsson.rebuild.RebuildSettings>
  </properties>
  <scm class="hudson.scm.NullSCM"/>
  <canRoam>true</canRoam>
  <disabled>false</disabled>
  <blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
  <blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
  <triggers/>
  <concurrentBuild>true</concurrentBuild>
  <builders>
    <hudson.tasks.Shell>
      <command>echo &quot;Hello Beautiful World&quot;
sleep 60
echo &quot;Goodbye Cruel World&quot;
</command>
    </hudson.tasks.Shell>
  </builders>
  <publishers/>
  <buildWrappers/>
</project>
alexindigo commented 6 years ago

Seeing similar issue. I have only global templates and sometimes it "hangs" and I see that thing in the logs – Not provisioning additional slaves for docker-XXXX; we have 1 executors being started already

Although instanceCap is set to 100.

glance- commented 5 years ago

I've ran into this one to.

My guess is that this comes from the default constructor setting instanceCap to 1: src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java#L105

And, later in the instanceCap algorithm (/src/main/java/com/nirima/jenkins/plugins/docker/DockerCloud.java#L617) , it matches on image name, and then runs into this limit.

I've "fixed" this by hacking into the job templates and changing that value via a groovy system script: https://gist.github.com/glance-/0137aad1e04f9e69b589de63e85206dd

It was a bit tricky due to the final modifier there, but with the reflections hammer it was fixable.