jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
583 stars 313 forks source link

Removenode Stalls (version 0.95.6 with SGE plug-in) #615

Open tmsturtz opened 6 years ago

tmsturtz commented 6 years ago

We have been running into an issue where the loadbalancer is only removing nodes from SGE and is not terminating nodes. Additional testing showed that 'removenode' is not working either and it stalls once it reaches the point of removing nodes from the known_hosts files.

We are running version 0.95.6 with the SGE and tagger plug-ins and have been having this issue on a 173 node cluster. We can only terminate nodes through the AWS interface. Any suggestions or thoughts on how best to troubleshoot this problem are welcomed.