jenkinsci / openstack-cloud-plugin

Provision nodes from OpenStack on demand
https://plugins.jenkins.io/openstack-cloud
MIT License
47 stars 83 forks source link

Instance cap is reached although the instances are not allocated in OS #74

Open winklerm opened 8 years ago

winklerm commented 8 years ago

It seems that either the instance cap is not reset properly, or the message printed to the log is misleading. The following log excerpt corresponds to the state when there were no nodes provisioned by OpenStackCloud plugin visible in QEOS. However, it starts with complaining about reaching the instance cap:

May 18, 2016 6:59:04 AM jenkins.plugins.openstack.compute.JCloudsCloud getAvailableTemplateProvider
INFO: Global instance cap (60) reached while adding capacity for label: RHEL
May 18, 2016 6:59:04 AM jenkins.plugins.openstack.compute.JCloudsSlaveTemplate provision
INFO: Provisioning new openstack node os-rhel7-715 with options jenkins.plugins.openstack.compute.SlaveOptions@44b9cef5[imageId=bxms-rhel7.2-snapshot,hardwareId=4,networkId=<null>,userDataId=jenkins.plugins.openstack.compute.UserDataConfig.1454073256508,instanceCap=60,floatingIpPool=public,securityGroups=default,availabilityZone=nova,startTimeout=1800000,keyPairName=bxms-qe-jenkins,numExecutors=1,jvmOptions=-Xmx700m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Djavax.net.ssl.trustStore=/qa/tools/keystore/mw.truststore -Djavax.net.ssl.trustStorePassword=changeit -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9110 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false,fsRoot=/home/hudson/hudson_workspace,credentialsId=c7c5fd01-0bc7-48dc-bb50-c0e46d2b2464,slaveType=SSH,retentionTime=30]
...
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:9.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:8.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:7.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:6.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:5.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:4.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:3.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:2.738923
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:1.7389231
May 18, 2016 6:59:04 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel7 from qeos with 1 executors. Remaining excess workload:0.7389231

I still have from time to time clean the OpenStack instances - they are usually not in ERROR state, but in ACTIVE, although they often do not have IP address associated (not even the internal IP, i.e. not the Floating-IP address). I then have to shut them down using QEOS API - could this mean that the plugin is still tracking them somewhere and believes that the instance count is higher than it actually is?

olivergondza commented 8 years ago

The plugin calculates number of running nodes based on machines running in openstack (with plugin fingerprint to exclude machines provisioned otherwise) as that is the only reliable way to ensure we will not spin more machines. If it ware tracking Jenkins node number, it will miss machines being provisioned/deleted, leaked machines, incorrectly provisioned but still running machines. So in case you have machines running with no Jenkins node counterparts, that is likely the cause of the problem.

Please make sure, before you delete such "dangling machine" to grep jenkins log and report it so we can identify and fix the cause. Or use Cloud Statistics plugin to highlight these problems. Btw, I fixed something similar yesterday: https://github.com/jenkinsci/openstack-cloud-plugin/commit/641328806517ee694e22dc75b5ca921c0a95171e

winklerm commented 8 years ago

Great, thanks! I will keep an eye on this and try to grep the log in the case of a leaked machine.

Btw. the Cloud Statistics is awesome!

winklerm commented 8 years ago

I encountered the following log - it seems that the machine is attempted to be provisioned even though the instance cap has been reached?

INFO: Started provisioning os-rhel6 from qeos with 1 executors. Remaining excess workload:0.85063934
May 21, 2016 6:37:56 AM hudson.slaves.NodeProvisioner update
INFO: Started provisioning os-rhel6 from qeos with 1 executors. Remaining excess workload:-0.14936066
May 21, 2016 6:38:04 AM jenkins.plugins.openstack.compute.JCloudsCloud getAvailableTemplateProvider
INFO: Global instance cap (60) reached while adding capacity for label: wls12
May 21, 2016 6:38:04 AM jenkins.plugins.openstack.compute.JCloudsSlaveTemplate provision
INFO: Provisioning new openstack node os-rhel7-wls-570 with options jenkins.plugins.openstack.compute.SlaveOptions@611be1f5[imageId=bxms-rhel7-wls-snapshot,hardwareId=4,networkId=<null>,userDataId=jenkins.plugins.openstack.compute.UserDataConfig.1454073256508,instanceCap=60,floatingIpPool=public,securityGroups=default,availabilityZone=nova,startTimeout=1800000,keyPairName=bxms-qe-jenkins,numExecutors=1,jvmOptions=-Xmx700m -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Djavax.net.ssl.trustStore=/qa/tools/keystore/mw.truststore -Djavax.net.ssl.trustStorePassword=changeit -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9110 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false,fsRoot=/home/hudson/hudson_workspace,credentialsId=c7c5fd01-0bc7-48dc-bb50-c0e46d2b2464,slaveType=SSH,retentionTime=30]
olivergondza commented 8 years ago

@winklerm, I am not quite satisfied by the way the instance cap is reported (https://github.com/jenkinsci/openstack-cloud-plugin/blob/master/src/main/java/jenkins/plugins/openstack/compute/JCloudsCloud.java#L195-L249). It can mean both it was reached while provisioning (so it can be the 60th machine starting in your case) or provisioning with the cloud was aborted as instance cap was reached by previous provisioning attemps.

winklerm commented 8 years ago

@olivergondza, I see, thanks!