CLUES is an energy management system for High Performance Computing (HPC) Clusters and Cloud infrastructures. The main function of the system is to power off internal cluster nodes when they are not being used, and conversely to power them on when they are needed. CLUES system integrates with the cluster management middleware, such as a batch-queuing system or a cloud infrastructure management system, by means of different connectors.
CLUES also integrates with the physical infrastructure by means of different plug-ins, so that nodes can be powered on/off using the techniques which best suit each particular infrastructure (e.g. using wake-on-LAN, Intelligent Platform Management Interface (IPMI) or Power Device Units, PDU).
Although there exist some batch-queuing systems that provide energy saving mechanisms, some of the most popular choices, such as Torque/PBS, lack this possibility. As far as cloud infrastructure management middleware is concerned, none of the most usual options for scientific environments provide similar features. The additional advantage of the approach taken by CLUES is that it can be integrated with virtually any resource manager, whether or not the manager provides energy saving features.
In order to install CLUES you can follow the next steps:
You need a python interpreter and the easy_install commandline tool. In ubuntu, you can install these things as
$ apt-get -y install python python-setuptools
Git is also needed, in order to get the source code
$ apt-get -y install git
Now you need to install the cpyutils
$ git clone https://github.com/grycap/cpyutils
$ mv cpyutils /opt
$ cd /opt/cpyutils
$ python setup.py install --record installed-files.txt
In case that you want, you can safely remove the /opt/cpyutils
folder. But it is recommended to keep the installed-files.txt
file just in order to be able to uninstall the cpyutils.
Finally you need to install two python modules from pip:
$ easy_install ply web.py
Firt of all, you need to get the CLUES source code and then install it.
$ git clone https://github.com/grycap/clues
$ mv clues /opt
$ cd /opt/clues
$ python setup.py install --record installed-files.txt
In case that you want, you can safely remove the /opt/clues
folder. But it is recommended to keep the installed-files.txt
file just in order to be able to uninstall CLUES.
Now you must config CLUES, as it won't work unless you have a valid configuration.
You need a /etc/clues2/clues2.cfg file. So you can get the template and use it for your convenience.
$ cd /etc/clues2
$ cp clues2.cfg-example clues2.cfg
Now you can edit the /etc/clues2/clues2.cfg
and adjust its parameters for your specific deployment.
The most important parameters that you MUST adjust are LRMS_CLASS
, POWERMANAGER_CLASS
and SCHEDULER_CLASSES
.
For the LRMS_CLASS
you have different options available (you MUST state one and only one of them):
For the POWERMANAGER_CLASS
you have different options available (you MUST state one and only one of them):
Finally, you should state the CLUES schedulers that you want to use. It is a comma-separated ordered list where the schedulers are being called in the same order that they are stated.
For the SCHEDULER_CLASSES
parameter you have the following options available:
Each of the LRMS, POWERMANAGER or SCHEDULER has its own options that should be properly configured.
In this example we are integrating CLUES in a working SLURM 16.05.8 deployment, which is prepared to ppower on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the SLURM deployment and to intercept the requests for new jobs using sbatch.
On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:
[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.slurm
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60
Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the SLURM and IPMI plugins. So we are creating the proper files:
$ cd /etc/clues2/conf.d/
$ cp plugin-slurm.cfg-example plugin-slurm.cfg
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg
You should check the variables in the /etc/clues2/conf.d/plugin-slurm.cfg
file to match your platform, but the default values may suitable for you. The expected include getting the nodes, queues, jobs, etc.
In the /etc/clues2/conf.d/plugin-ipmi.cfg
we should check the variables IPMI_HOSTS_FILE
and IPMI_CMDLINE_POWON
and IPMI_CMDLINE_POWOF
, and set them to the proper values of your deployment.
[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off
The ipmi.hosts
should be located in the folder /etc/clues2/
and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts
file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.
192.168.1.100 niebla01
192.168.1.102 niebla02
192.168.1.103 niebla03
192.168.1.104 niebla04
The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool
app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname
The SLURM addon is based in substituting the command sbatch
by the CLUES sbatch
to check whether new nodes are needed. Later this command will call the original SLURM sbatch
command to queue the jobs. In order to make it, you should rename the original sbatch command to sbatch.o and then copy the CLUES' one:
# In the case of debian based distributions (e.g. ubuntu)
mv /usr/local/bin/sbatch /usr/local/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/local/bin/sbatch
# In the case of red-hat based distributions (e.g. fedora, scientific linux)
mv /usr/bin/sbatch /usr/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/bin/sbatch
Take into account that the users that are able to use sbatch must be able to read the configuration of CLUES.
In this example we are integrating CLUES in a OpenNebula 4.8 deployment, which is prepared to power on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the ONE deployment and to intercept the requests for new VMs.
On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:
[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.one
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60
Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the ONE and IPMI plugins. So we are creating the proper files:
$ cd /etc/clues2/conf.d/
$ cp plugin-one.cfg-example plugin-one.cfg
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg
In the /etc/clues2/conf.d/plugin-one.cfg
we should check the variables ONE_XMLRPC
and ONE_AUTH
, and set them to the proper values of your deployment. The credentials in the ONE_AUTH
variable should be of a user in the oneadmin
group (you can use the oneadmin user or create a new one in ONE).
[ONE LRMS]
ONE_XMLRPC=http://localhost:2633/RPC2
ONE_AUTH=clues:cluespass
In the /etc/clues2/conf.d/plugin-ipmi.cfg
we should check the variables IPMI_HOSTS_FILE
and IPMI_CMDLINE_POWON
and IPMI_CMDLINE_POWOF
, and set them to the proper values of your deployment.
[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off
The ipmi.hosts
should be located in the folder /etc/clues2/
and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts
file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.
192.168.1.100 niebla01
192.168.1.102 niebla02
192.168.1.103 niebla03
192.168.1.104 niebla04
The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool
app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname
The hooks mechanism of CLUES enables to call specific applications when different events happen in the system. E.g. when a node is powered on or off. One immediate application of this system is to send an e-mail to the admin when a node has failed to be powered on.
Hooks are custom external scripts (or applications) that are executed when some events happen. CLUES includes the possibility to define the next hooks:
CLUES has a report generator that has been created to help to monitor your infrastructure, regarding to CLUES.
The reports that generate CLUES provide the next information:
CLUES provide reports in the form of web pages. So you will need a browser to open these reports. Once opened, the reports web page will look like the next one:
Refer to the Reports documentation to get more information about how to create the reports.
You can get information in the CLUES log file (i.e. /var/log/clues2/clues2.log
). But you can also set the LOG_FILE
to a empty value in the /etc/clues2/clues2.cfg
file and execute CLUES as
$ /usr/bin/python /usr/local/bin/cluesserver
In the logging information you can find useful messages to debug what is happening. Here we highlight some common issues.
Some messages like
[DEBUG] 2015-06-18 09:41:57,551 could not contact to the ONE server
[WARNING] 2015-06-18 09:41:57,551 an error occurred when monitoring hosts (could not get information from ONE; please check ONE_XMLRPC and ONE_AUTH vars)
usually mean that either the URL that is pointed by ONE_XMLRPC is wrong (or not reachable) or the ONE_AUTH information has not enough privileges.
In a distributed configuration, maybe the ONE server is not reachable from outside the localhost.
When using the client, a message like the next one
$ clues status
Could not get the status of CLUES (Error checking the secret key. Please check the configuration file and the CLUES_SECRET_TOKEN setting)
is usually a sympthom that the CLUES commandline has not permissions to read the clues2.cfg. Please check that the users are able to read the configuration of CLUES.