stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
36 stars 15 forks source link
ansible hpc hpc-clusters slurm

Test deployment and image build on OpenStack

StackHPC Slurm Appliance

This repository contains playbooks and configuration to define a Slurm-based HPC environment including:

The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us!

While it is tested on OpenStack it should work on any cloud, except for node rebuild/reimaging features which are currently OpenStack-specific.

Prerequisites

It is recommended to check the following before starting:

Installation on deployment host

These instructions assume the deployment host is running Rocky Linux 8:

sudo yum install -y git python38
git clone https://github.com/stackhpc/ansible-slurm-appliance
cd ansible-slurm-appliance
/usr/bin/python3.8 -m venv venv
. venv/bin/activate
pip install -U pip
pip install -r requirements.txt
# Install ansible dependencies ...
ansible-galaxy role install -r requirements.yml -p ansible/roles
ansible-galaxy collection install -r requirements.yml -p ansible/collections # ignore the path warning here

Overview of directory structure

Environments

Overview

An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in environments/, containing:

All environments load the inventory from the common environment first, with the environment-specific inventory then overriding parts of this as required.

Creating a new environment

This repo contains a cookiecutter template which can be used to create a new environment from scratch. Run the installation on deployment host instructions above, then in the repo root run:

. venv/bin/activate
cd environments
cookiecutter skeleton

and follow the prompts to complete the environment name and description.

Alternatively, you could copy an existing environment directory.

Now add deployment automation if required, and then complete the environment-specific inventory as described below.

Environment-specific inventory structure

The ansible inventory for the environment is in environments/<environment>/inventory/. It should generally contain:

Although most of the inventory uses the group convention described above there are a few special cases:

Creating a Slurm appliance

NB: This section describes generic instructions - check for any environment-specific instructions in environments/<environment>/README.md before starting.

  1. Activate the environment - this must be done before any other commands are run:

    source environments/<environment>/activate
  2. Deploy instances - see environment-specific instructions.

  3. Generate passwords:

    ansible-playbook ansible/adhoc/generate-passwords.yml

    This will output a set of passwords in environments/<environment>/inventory/group_vars/all/secrets.yml. It is recommended that these are encrpyted and then commited to git using:

    ansible-vault encrypt inventory/group_vars/all/secrets.yml

    See the Ansible vault documentation for more details.

  4. Deploy the appliance:

    ansible-playbook ansible/site.yml

    or if you have encrypted secrets use:

    ansible-playbook ansible/site.yml --ask-vault-password

    Tags as defined in the various sub-playbooks defined in ansible/ may be used to only run part of the site tasks.

  5. "Utility" playbooks for managing a running appliance are contained in ansible/adhoc - run these by activating the environment and using:

    ansible-playbook ansible/adhoc/<playbook name>

    Currently they include the following (see each playbook for links to documentation):

    • hpctests.yml: MPI-based cluster tests for latency, bandwidth and floating point performance.
    • rebuild.yml: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster - see ansible/roles/rebuild for that).
    • restart-slurm.yml: Restart all Slurm daemons in the correct order.
    • update-packages.yml: Update specified packages on cluster nodes.

Adding new functionality

Please contact us for specific advice, but in outline this generally involves:

Monitoring and logging

Please see the monitoring-and-logging.README.md for details.