grycap / clues

CLUES: an energy management system for HPC Clusters and Cloud infrastructures.
http://www.grycap.upv.es/clues
GNU General Public License v3.0
24 stars 7 forks source link

CLUES

CLUES is an energy management system for High Performance Computing (HPC) Clusters and Cloud infrastructures. The main function of the system is to power off internal cluster nodes when they are not being used, and conversely to power them on when they are needed. CLUES system integrates with the cluster management middleware, such as a batch-queuing system or a cloud infrastructure management system, by means of different connectors.

CLUES also integrates with the physical infrastructure by means of different plug-ins, so that nodes can be powered on/off using the techniques which best suit each particular infrastructure (e.g. using wake-on-LAN, Intelligent Platform Management Interface (IPMI) or Power Device Units, PDU).

Although there exist some batch-queuing systems that provide energy saving mechanisms, some of the most popular choices, such as Torque/PBS, lack this possibility. As far as cloud infrastructure management middleware is concerned, none of the most usual options for scientific environments provide similar features. The additional advantage of the approach taken by CLUES is that it can be integrated with virtually any resource manager, whether or not the manager provides energy saving features.

Installing

In order to install CLUES you can follow the next steps:

Prerrequisites

You need a python interpreter and the easy_install commandline tool. In ubuntu, you can install these things as

$ apt-get -y install python python-setuptools

Git is also needed, in order to get the source code

$ apt-get -y install git

Now you need to install the cpyutils

$ git clone https://github.com/grycap/cpyutils
$ mv cpyutils /opt
$ cd /opt/cpyutils
$ python setup.py install --record installed-files.txt

In case that you want, you can safely remove the /opt/cpyutils folder. But it is recommended to keep the installed-files.txt file just in order to be able to uninstall the cpyutils.

Finally you need to install two python modules from pip:

$ easy_install ply web.py

Installing CLUES

Firt of all, you need to get the CLUES source code and then install it.

$ git clone https://github.com/grycap/clues
$ mv clues /opt
$ cd /opt/clues
$ python setup.py install --record installed-files.txt

In case that you want, you can safely remove the /opt/clues folder. But it is recommended to keep the installed-files.txt file just in order to be able to uninstall CLUES.

Now you must config CLUES, as it won't work unless you have a valid configuration.

Configuring CLUES

You need a /etc/clues2/clues2.cfg file. So you can get the template and use it for your convenience.

$ cd /etc/clues2
$ cp clues2.cfg-example clues2.cfg

Now you can edit the /etc/clues2/clues2.cfg and adjust its parameters for your specific deployment.

The most important parameters that you MUST adjust are LRMS_CLASS, POWERMANAGER_CLASS and SCHEDULER_CLASSES.

For the LRMS_CLASS you have different options available (you MUST state one and only one of them):

For the POWERMANAGER_CLASS you have different options available (you MUST state one and only one of them):

Finally, you should state the CLUES schedulers that you want to use. It is a comma-separated ordered list where the schedulers are being called in the same order that they are stated.

For the SCHEDULER_CLASSES parameter you have the following options available:

Each of the LRMS, POWERMANAGER or SCHEDULER has its own options that should be properly configured.

Example configuration with SLURM

In this example we are integrating CLUES in a working SLURM 16.05.8 deployment, which is prepared to ppower on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the SLURM deployment and to intercept the requests for new jobs using sbatch.

On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:

[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.slurm
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60

Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the SLURM and IPMI plugins. So we are creating the proper files:

$ cd /etc/clues2/conf.d/
$ cp plugin-slurm.cfg-example plugin-slurm.cfg         
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg         

You should check the variables in the /etc/clues2/conf.d/plugin-slurm.cfg file to match your platform, but the default values may suitable for you. The expected include getting the nodes, queues, jobs, etc.

In the /etc/clues2/conf.d/plugin-ipmi.cfg we should check the variables IPMI_HOSTS_FILE and IPMI_CMDLINE_POWON and IPMI_CMDLINE_POWOF, and set them to the proper values of your deployment.

[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off

The ipmi.hosts should be located in the folder /etc/clues2/ and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.

192.168.1.100   niebla01
192.168.1.102   niebla02
192.168.1.103   niebla03
192.168.1.104   niebla04

The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname

The SLURM addon is based in substituting the command sbatch by the CLUES sbatch to check whether new nodes are needed. Later this command will call the original SLURM sbatch command to queue the jobs. In order to make it, you should rename the original sbatch command to sbatch.o and then copy the CLUES' one:

# In the case of debian based distributions (e.g. ubuntu)
mv /usr/local/bin/sbatch /usr/local/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/local/bin/sbatch

# In the case of red-hat based distributions (e.g. fedora, scientific linux)
mv /usr/bin/sbatch /usr/bin/sbatch.o
cp /usr/local/bin/clues-slurm-wrapper /usr/bin/sbatch

Take into account that the users that are able to use sbatch must be able to read the configuration of CLUES.

Example configuration with ONE

In this example we are integrating CLUES in a OpenNebula 4.8 deployment, which is prepared to power on or off the working nodes using IPMI. In the next steps we are configuring CLUES to monitor the ONE deployment and to intercept the requests for new VMs.

On the one side, we must set the proper values in /etc/clues2/clues2.cfg. The most important values are:

[general]
CONFIG_DIR=conf.d
LRMS_CLASS=cluesplugins.one
POWERMANAGER_CLASS=cluesplugins.ipmi
MAX_WAIT_POWERON=300
...
[monitoring]
COOLDOWN_SERVED_REQUESTS=300
...
[scheduling]
SCHEDULER_CLASSES=clueslib.schedulers.CLUES_Scheduler_PowOn_Requests, clueslib.schedulers.CLUES_Scheduler_Reconsider_Jobs, clueslib.schedulers.CLUES_Scheduler_PowOff_IDLE, clueslib.schedulers.CLUES_Scheduler_PowOn_Free
IDLE_TIME=600
RECONSIDER_JOB_TIME=600
EXTRA_SLOTS_FREE=0
EXTRA_NODES_PERIOD=60

Once this file is configured, we can use the templates in the /etc/clues2/conf.d folder to configure the ONE and IPMI plugins. So we are creating the proper files:

$ cd /etc/clues2/conf.d/
$ cp plugin-one.cfg-example plugin-one.cfg         
$ cp plugin-ipmi.cfg-example plugin-ipmi.cfg         

In the /etc/clues2/conf.d/plugin-one.cfg we should check the variables ONE_XMLRPC and ONE_AUTH, and set them to the proper values of your deployment. The credentials in the ONE_AUTH variable should be of a user in the oneadmin group (you can use the oneadmin user or create a new one in ONE).

[ONE LRMS]
ONE_XMLRPC=http://localhost:2633/RPC2
ONE_AUTH=clues:cluespass

In the /etc/clues2/conf.d/plugin-ipmi.cfg we should check the variables IPMI_HOSTS_FILE and IPMI_CMDLINE_POWON and IPMI_CMDLINE_POWOF, and set them to the proper values of your deployment.

[IPMI]
IPMI_HOSTS_FILE=ipmi.hosts
IPMI_CMDLINE_POWON=/usr/bin/ipmitool -I lan -H %%a -P "" power on
IPMI_CMDLINE_POWOFF=/usr/bin/ipmitool -I lan -H %%a -P "" power off

The ipmi.hosts should be located in the folder /etc/clues2/ and contains the correspondences of the IPMI IP addresses and the names of the hosts that appear in ONE, using the well known /etc/hosts file format. An example for this file is, where the first column is the IPMI IP address and the second column is the name of the host as appears in ONE.

192.168.1.100   niebla01
192.168.1.102   niebla02
192.168.1.103   niebla03
192.168.1.104   niebla04

The you should adjust the commandline for powering on and off the working nodes, using IPMI. In the default configuration we use the common ipmitool app and we use a passwordless connection to the IPMI interface. To adjust the commandline you can use %%a to substitute the IP address and %%h to substitute the hostname

Hooks system

The hooks mechanism of CLUES enables to call specific applications when different events happen in the system. E.g. when a node is powered on or off. One immediate application of this system is to send an e-mail to the admin when a node has failed to be powered on.

Hooks are custom external scripts (or applications) that are executed when some events happen. CLUES includes the possibility to define the next hooks:

Reports

CLUES has a report generator that has been created to help to monitor your infrastructure, regarding to CLUES.

The reports that generate CLUES provide the next information:

CLUES provide reports in the form of web pages. So you will need a browser to open these reports. Once opened, the reports web page will look like the next one:

The CLUES reports web page

Refer to the Reports documentation to get more information about how to create the reports.

Troubleshooting

You can get information in the CLUES log file (i.e. /var/log/clues2/clues2.log). But you can also set the LOG_FILE to a empty value in the /etc/clues2/clues2.cfg file and execute CLUES as

$ /usr/bin/python /usr/local/bin/cluesserver

In the logging information you can find useful messages to debug what is happening. Here we highlight some common issues.

Wrong ONE configuration

Some messages like

[DEBUG] 2015-06-18 09:41:57,551 could not contact to the ONE server
[WARNING] 2015-06-18 09:41:57,551 an error occurred when monitoring hosts (could not get information from ONE; please check ONE_XMLRPC and ONE_AUTH vars)

usually mean that either the URL that is pointed by ONE_XMLRPC is wrong (or not reachable) or the ONE_AUTH information has not enough privileges.

In a distributed configuration, maybe the ONE server is not reachable from outside the localhost.

Lack of permission

When using the client, a message like the next one

$ clues status
Could not get the status of CLUES (Error checking the secret key. Please check the configuration file and the CLUES_SECRET_TOKEN setting)

is usually a sympthom that the CLUES commandline has not permissions to read the clues2.cfg. Please check that the users are able to read the configuration of CLUES.