jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
582 stars 308 forks source link

Load Balancing Error: Number of slots not consistent across cluster #374

Open wcolen opened 10 years ago

wcolen commented 10 years ago

Hi! I would like to report this load balancing error. It causes some nodes to be kept alive without jobs.

Thank you!

Stack trace:

Execution hosts: 7 Queued jobs: 0 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:53:05

Not adding nodes: at or above minimum nodes and no queued jobs... Cluster was modified less than 180 seconds ago Waiting for cluster to stabilize... Sleeping...(looping again in 60 secs)

Execution hosts: 7 Queued jobs: 0 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:53:05

Not adding nodes: at or above minimum nodes and no queued jobs... Looking for nodes to remove... Idle node node007 (i-a2876381) has been up for 55 minutes past the hour Idle node node008 (i-7884605b) has been up for 49 minutes past the hour Idle node node010 (i-05fc1826) has been up for 58 minutes past the hour Idle node node012 (i-81fb1fa2) has been up for 45 minutes past the hour * WARNING - Removing node007: i-a2876381 (ec2-54-81-87-184.compute-1.amazonaws.com) Running plugin starcluster.plugins.sge.SGEPlugin Removing node007 from SGE Updating SGE parallel environment 'orte' 6/6 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
Adding parallel environment 'orte' to queue 'all.q' Running plugin starcluster.clustersetup.DefaultClusterSetup Removing node node007 (i-a2876381)... Removing node007 from known_hosts files Removing node007 from /etc/hosts Removing node007 from NFS Terminating node: node007 (i-a2876381)
* WARNING - Removing node008: i-7884605b (ec2-54-81-38-187.compute-1.amazonaws.com) Running plugin starcluster.plugins.sge.SGEPlugin Removing node008 from SGE !!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin': !!! ERROR - Failed to remove node node008 Traceback (most recent call last): File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node self._cluster.remove_node(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node force=force) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes reverse=True) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins self.run_plugin(plug, method_name=method_name, node=node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin func(_args) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node self._remove_from_sge(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge master.ssh.execute('qconf -de %s' % node.alias) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute msg, command, exit_status, outstr) RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node008' failed with status 1: Host object "node008" is still referenced in cluster queue "all.q". ** WARNING - Removing node010: i-05fc1826 (ec2-54-80-69-227.compute-1.amazonaws.com) Running plugin starcluster.plugins.sge.SGEPlugin Removing node010 from SGE !!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin': !!! ERROR - Failed to remove node node010 Traceback (most recent call last): File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node self._cluster.remove_node(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node force=force) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes reverse=True) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins self.run_plugin(plug, method_name=method_name, node=node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin func(_args) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node self._remove_from_sge(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge master.ssh.execute('qconf -de %s' % node.alias) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute msg, command, exit_status, outstr) RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node010' failed with status 1: Host object "node010" is still referenced in cluster queue "all.q". ** WARNING - Removing node012: i-81fb1fa2 (ec2-54-80-193-84.compute-1.amazonaws.com) Running plugin starcluster.plugins.sge.SGEPlugin Removing node012 from SGE !!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin': !!! ERROR - Failed to remove node node012 Traceback (most recent call last): File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/balancers/sge/init.py", line 745, in _eval_remove_node self._cluster.remove_node(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1049, in remove_node force=force) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1075, in remove_nodes reverse=True) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1694, in run_plugins self.run_plugin(plug, method_name=method_name, node=node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/cluster.py", line 1719, in run_plugin func(*args) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 172, in on_remove_node self._remove_from_sge(node) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/plugins/sge.py", line 134, in _remove_from_sge master.ssh.execute('qconf -de %s' % node.alias) File "/Library/Python/2.7/site-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute msg, command, exit_status, out_str) RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node012' failed with status 1: Host object "node012" is still referenced in cluster queue "all.q". Sleeping...(looping again in 60 secs)

Execution hosts: 6 Queued jobs: 139 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03

Cluster was modified less than 180 seconds ago Waiting for cluster to stabilize... Sleeping...(looping again in 60 secs)

Execution hosts: 6 Queued jobs: 137 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03

Cluster was modified less than 180 seconds ago Waiting for cluster to stabilize... Sleeping...(looping again in 60 secs)

Execution hosts: 6 Queued jobs: 134 Oldest queued job: 2014-02-28 08:57:32+00:00 Avg job duration: 345 secs Avg job wait time: 570 secs Last cluster modification time: 2014-02-28 08:59:03 !!! ERROR - ERROR: Number of slots not consistent across cluster

nathanieldchu commented 6 years ago

I've also run into this. Oddly it happened using the same config file and settings as a cluster that worked perfectly fine a few weeks ago.