jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
582 stars 313 forks source link

Default behaviour of SGE inhibits the load balancer from shutting down nodes #158

Open scrappythekangaroo opened 11 years ago

scrappythekangaroo commented 11 years ago

It seems that the default behaviour of SGE is to use "load_formula = np_load_avg" (see qconf -ssconf) which will balance jobs across nodes.

For example:

  1. My cluster currently has three nodes up and the queue is currently empty
  2. Three new jobs come in -- these will most likely be spread across each of the three nodes
  3. Since all three nodes have processes on them the load balancer will not be able to shut down any of the nodes even though the cluster is under-utilised

I'd suggest modifying the SGE setup to use the "fill up host" configuration according to: http://wiki.gridengine.info/wiki/index.php/StephansBlog

Even better would be to configure SGE to send jobs to the most recently booted node first so that we may shut down older nodes first (hopefully before their hour is up). I'm not yet sure if this is possible.

scrappythekangaroo commented 11 years ago

Example code that applies the "fill up host" change here: https://github.com/scrappythekangaroo/StarCluster/commit/fb545951667d4d413305ef8b61c93ce28d9b062f