hpcugent / hanythingondemand

hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs
https://hod.readthedocs.org
GNU General Public License v2.0
12 stars 6 forks source link

Support restarting the cluster slaves #4

Open ehiggs opened 10 years ago

ehiggs commented 10 years ago

Currently, we don't allow users to set the Hadoop configuration options on startup. This is covered by issue #1 . Because we don't allow that, users should be able to login to nodes and take down the cluster, change settings, and start it back up. This fails for two reasons:

  1. The $HADOOP_CONF_DIR/slaves file is missing. This is merely a list of hostnames for the nodes running slave tasks. This should probably exist.
  2. The way that stop-mapred.sh and start-mapred.sh work is by sshing to each of the slaves and taking down the tasks (JobTracker, TaskTracker). However, when we ssh into each node, we lose our environment and thus the job loses track of where the hadoop scripts are ($HADOOP_HOME isn't found).

Either we need to find a way to setup the environment so these scripts work or we should provide our own scripts which do the same thing and let users bounce tasks on their own.

stdweird commented 10 years ago

The mpi service should be "callable" from the outside somehow to coordinate the stopping/restarting of some or all hadoop services. The same can the be reused to restart said services.

Forced stopping can already be done via force_fn = os.path.join(self.controldir, 'force_stop') similar "primitive" control can be used to stop and wait and to restart without reconfigure.