Not adding nodes: at or above minimum nodes and no queued jobs...
Cluster was modified less than 180 seconds ago
Waiting for cluster to stabilize...
Sleeping...(looping again in 60 secs)
Not adding nodes: at or above minimum nodes and no queued jobs...
Looking for nodes to remove...
Idle node node007 (i-a2876381) has been up for 55 minutes past the hour
Idle node node008 (i-7884605b) has been up for 49 minutes past the hour
Idle node node010 (i-05fc1826) has been up for 58 minutes past the hour
Idle node node012 (i-81fb1fa2) has been up for 45 minutes past the hour
* WARNING - Removing node007: i-a2876381 (ec2-54-81-87-184.compute-1.amazonaws.com)
Running plugin starcluster.plugins.sge.SGEPlugin
Removing node007 from SGE
Updating SGE parallel environment 'orte'
6/6 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
Adding parallel environment 'orte' to queue 'all.q'
Running plugin starcluster.clustersetup.DefaultClusterSetup
Removing node node007 (i-a2876381)...
Removing node007 from known_hosts files
Removing node007 from /etc/hosts
Removing node007 from NFS
Terminating node: node007 (i-a2876381)
* WARNING - Removing node008: i-7884605b (ec2-54-81-38-187.compute-1.amazonaws.com)
Running plugin starcluster.plugins.sge.SGEPlugin
Removing node008 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to remove node node008
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node
self._cluster.remove_node(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node
force=force)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes
reverse=True)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins
self.run_plugin(plug, method_name=method_name, node=node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin
func(_args)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node
self._remove_from_sge(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge
master.ssh.execute('qconf -de %s' % node.alias)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute
msg, command, exit_status, outstr)
RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node008' failed with status 1:
Host object "node008" is still referenced in cluster queue "all.q".
** WARNING - Removing node010: i-05fc1826 (ec2-54-80-69-227.compute-1.amazonaws.com)
Running plugin starcluster.plugins.sge.SGEPlugin
Removing node010 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to remove node node010
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node
self._cluster.remove_node(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node
force=force)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes
reverse=True)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins
self.run_plugin(plug, method_name=method_name, node=node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin
func(_args)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node
self._remove_from_sge(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge
master.ssh.execute('qconf -de %s' % node.alias)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute
msg, command, exit_status, outstr)
RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node010' failed with status 1:
Host object "node010" is still referenced in cluster queue "all.q".
** WARNING - Removing node012: i-81fb1fa2 (ec2-54-80-193-84.compute-1.amazonaws.com)
Running plugin starcluster.plugins.sge.SGEPlugin
Removing node012 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to remove node node012
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node
self._cluster.remove_node(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node
force=force)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes
reverse=True)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins
self.run_plugin(plug, method_name=method_name, node=node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin
func(*args)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node
self._remove_from_sge(node)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge
master.ssh.execute('qconf -de %s' % node.alias)
File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute
msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node012' failed with status 1:
Host object "node012" is still referenced in cluster queue "all.q".
Sleeping...(looping again in 60 secs)
Hi! I would like to report this load balancing error. It causes some nodes to be kept alive without jobs.
Thank you!
Stack trace:
Execution hosts: 7 Queued jobs: 0 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:53:05
Execution hosts: 7 Queued jobs: 0 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:53:05
Execution hosts: 6 Queued jobs: 139 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03
Execution hosts: 6 Queued jobs: 137 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03
Execution hosts: 6 Queued jobs: 134 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03 !!! ERROR - ERROR: Number of slots not consistent across cluster