rug-cit-hpc / league-of-robots

Code and configs for deploying (virtual) HPC clusters.
GNU General Public License v3.0
4 stars 12 forks source link

League of Robots

develop branch CI status

CircleCI

master branch CI status

CircleCI

About this repo

This repository contains playbooks and documentation to deploy stacks of virtual machines working together. Most of these stacks are virtual Linux HPC clusters, which can be used as collaborative, analytical sandboxes. All production clusters were named after robots that appear in the animated sitcom Futurama. Test/development clusters were named after other robots.

Software/framework ingredients

The main ingredients for (deploying) these clusters:

Branches and Releases

The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests. Once a while we create releases, which are versioned using the format YY.MM.v where:

E.g. 19.01.1 is the first release in January 2019.

Code style and naming conventions

We follow the Python PEP8 naming conventions for variable names, function names, etc.

Clusters

This repo currently contains code and configs for the following clusters:

Deployment and functional administration of all clusters is a joined effort of the Genomics Coordination Center (GCC) and the Center for Information Technology (CIT) from the University Medical Center and University of Groningen, in collaboration with ELIXIR compute platform, EXCELERATE, EU-Solve-RD, European Joint Programme on Rare Diseases and CORBEL projects.

Cluster components

The clusters are composed of the following type of machines:

The clusters use the following types of storage systems / folders:

Filesystem/Folder Shared/Local Backups Mounted on Purpose/Features
/home/${home}/ Shared Yes UIs, DAIs, SAIs, CNs Only for personal preferences: small data == tiny quota.
/groups/${group}/prm[0-9]/ Shared Yes UIs, DAIs permanent storage folders: for rawdata or final results that need to be stored for the mid/long term.
/groups/${group}/tmp[0-9]/ Shared No UIs, DAIs, CNs temporary storage folders: for staged rawdata and intermediate results on compute nodes that only need to be stored for the short term.
/groups/${group}/scr[0-9]/ Local No Some UIs scratch storage folders: same as tmp, but local storage as opposed to shared storage. Optional and available on all UIs.
/local/${slurm_job_id} Local No CNs Local storage on compute nodes only available during job execution. Hence folders are automatically created when a job starts and deleted when it finishes.
/mnt/${complete_filesystem} Shared Mixed SAIs Complete file systems, which may contain various home, prm, tmp or scr dirs.

Other stacks

Some other stacks of related machines are:

Deployment phases

Deploying a fully functional stack of virtual machines from scratch involves the following steps:

  1. Configure physical machines
    • Off topic for this repo.
  2. Deploy OpenStack virtualization layer on physical machines to create an OpenStack cluster.
    • Off topic for this repo.
    • For the Shikra cloud, which hosts the Talos and Gearshift HPC clusters we use the ansible playbooks from the hpc-cloud repository to create the OpenStack cluster.
    • For other HPC clusters we use OpenStack clouds from other service providers as is.
  3. Create, start and configure virtual networks and machines on an OpenStack cluster.
    • This repo.
  4. Deploy bioinformatics software and reference datasets.
    • Off topic for this repo.
    • We use the ansible playbook from the ansible-pipelines repository to deploy Lua + Lmod + EasyBuild. The latter is then used to install bioinformatics tools.

Details for phase 3. Create, start and configure virtual machines on an OpenStack cluster.

0. Clone this repo and configure Python virtual environment.

mkdir -p ${HOME}/git/
cd ${HOME}/git/
git clone https://github.com/rug-cit-hpc/league-of-robots.git
cd league-of-robots
#
# For older openstacksdk < 0.99 we need the ansible openstack collection 1.x.
# For newer openstacksdk > 1.00 we need the ansible openstack collection 2.x.
#
openstacksdk_major_version='3'  # Change for older OpenStack SDK.
#
# Create Python virtual environment (once)
#
python3 -m venv openstacksdk-${openstacksdk_major_version:-3}.venv
#
# Activate virtual environment.
#
source openstacksdk-${openstacksdk_major_version:-3}.venv/bin/activate
#
# Install OpenStack SDK (once) and other python packages.
#
pip3 install --upgrade pip
pip3 install wheel
pip3 install setuptools  # No longer part of default Python >= 3.12.x, but we need it.
if [[ "${openstacksdk_major_version:-3}" -eq 0 ]]; then
  pip3 install "openstacksdk<0.99"
else
  pip3 install "openstacksdk==${openstacksdk_major_version:-3}.*"
fi
pip3 install openstackclient
pip3 install ruamel.yaml
pip3 install netaddr
#
# Package dnspython is required for Ansible lookup plugin community.general.dig
#
pip3 install dnspython
#
# On macOS only to prevent this error:
# crypt.crypt not supported on Mac OS X/Darwin, install passlib python module.
#
pip3 install passlib
#
# Optional: install Ansible and the Ansible linter with pip.
# You may skip this step if you already installed Ansible by other means.
# E.g. with HomeBrew on macOS, with yum or dnf on Linux, etc.
#
# Ansible core 2.16 from Ansible 9.x is latest version compatible with Mitogen.
#
pip3 install 'ansible<10' # For running playbooks on your local laptop as Ansible control host.
pip3 install 'ansible<6' # For running playbooks directly on chaperone machines running RHEL8.
pip3 install ansible-lint
#
# Optional: install Mitogen with pip.
# Mitogen provides an optional strategy plugin that makes playbooks a lot (up to 7 times!) faster.
# See https://mitogen.networkgenomics.com/ansible_detailed.html
#
pip3 install mitogen

1. Import the required roles and collections for the playbooks.

source openstacksdk-${openstacksdk_major_version:-3}.venv/bin/activate
export ANSIBLE_ROLES_PATH="${VIRTUAL_ENV}/ansible/ansible_roles/:"
export ANSIBLE_COLLECTIONS_PATH="${VIRTUAL_ENV}/ansible/:"
ansible-galaxy install -r requirements-${openstacksdk_major_version:-3}.yml

Note: the default location where these dependencies will get installed with the ansible-galaxy install command is ${HOME}/.ansible/, which may conflict with versions of roles and collections required for other repos. Therefore we set ANSIBLE_ROLES_PATH and ANSIBLE_COLLECTIONS_PATH to use a custom path for the dependencies inside the virtual environment we'll use for this repo.

2. Create a vault_pass.txt.

The vault password is used to encrypt/decrypt the secrets.yml file per stack_name, which will be created in the next step if you do not already have one. In addition a second vault passwd is used for various files in group_vars/all/ and which contain settings that are the same for all stacks. If you have multiple stacks with their own vault passwd you will have multiple vault password files. The pattern .vault* is part of .gitignore, so if you put the vault passwd files in the .vault/ subdir, they will not accidentally get committed to the repo.

3. Configure Ansible settings including the vault.

To create a new stack you will need group_vars and a static inventory for that stack:

To use use an existing encrypted group_vars/[stack_name]/secrets.yml:

4. Configure the Certificate Authority (CA).

We use an SSH public-private key pair to sign the host keys of all the machines in a cluster. This way users only need the public key of the CA in their ~.ssh/known_hosts file and will not get bothered by messages like this:

The authenticity of host '....' can't be established.
ED25519 key fingerprint is ....
Are you sure you want to continue connecting (yes/no)?

5. Build Prometheus Node Exporter

6. Generate munge key and encrypt it using Ansible Vault.

Execute:

mkdir -p files/[stack_name]
dd if=/dev/urandom bs=1 count=1024 > files/[stack_name]/munge.key
ansible-vault encrypt --encrypt-vault-id [stack_name] files/[stack_name]/munge.key

The encrypted files/[stack_name]/munge.key can now be committed safely.

7. Generate TLS certificate, passwords & hashes for the LDAP server and encrypt it using Ansible Vault.

If you do not configure any LDAP domains using the ldap_domains variable (see ldap_server role for details) in group_vars/[stack_name]/vars.yml, then the machines for the [stack_name] stack will use local accounts created on each machine and this step can be skipped.

If you configured ldap_domains in group_vars/[stack_name]/vars.yml and all LDAP domains have create_ldap: false, then this stack will/must use an external LDAP, that was configured & hosted elsewhere, and this step can be skipped.

If you configured one or more LDAP domains with create_ldap: true; E.g.:

   ldap_domains:
     stack:
       create_ldap: true
       .....
     other_domain:
       some_config_option: anothervalue
       create_ldap: true
       .....

Then this stack will create and run its own LDAP server. You will need to create:

7a TLS certificate for LDAP server.

Create key and CA certificate with one command

   openssl req -x509 -nodes -days 1825 -newkey rsa:4096 -keyout files/[stack_name]/ldap.key -out files/[stack_name]/ldap.crt

where you must correctly provide the following values

     Country Name (2 letter code) [XX]:NL
     State or Province Name (full name) []:Groningen
     Locality Name (eg, city) [Default City]:Groningen
     Organization Name (eg, company) [Default Company Ltd]:UMCG
     Organizational Unit Name (eg, section) []:GCC
     Common Name (eg, your name or your server's hostname) []:ladap
     Email Address []:hpc.helpdesk@umcg.nl

Note that the Common Name must be the address of the ldap server. Based on the type of the network access to the machine:

7b passwords and hashes for LDAP accounts.

When an OpenLDAP server is created, you will need passwords and corresponding hashes for the LDAP root account as well as for functional accounts for at least one LDAP domain. Therefore the minimal setup in group_vars/[stack_name]/secrets.yml is something like this:

openldap_root_pw: ''
openldap_root_hash: ''
ldap_credentials:
  stack:
    readonly:
      dn: 'cn=readonly,dc={{ use stack_name here }},dc=local'
      pw: ''
      hash: ''
    admin:
      dn: 'cn={{ use stack_prefix here }}-admin,dc={{ use stack_name here }},dc=local'
      pw: ''
      hash: ''

In this example the LDAP domain named stack is used for users & groups, that were created for and are used only on this stack of infra. You may have additional LDAP domains serving as other sources for users and groups.

The pw values may have been already generated with the generate_secrets.py script in step 3. If you added additional LDAP domains later you can, decrypt the group_vars/[stack_name]/secrets.yml with ansible-vault, rerun the generate_secrets.py script to generate additional password values and re-encrypt secret.yml with ansible-vault.

For each pw you will need to generate a corresponding hash. You cannot use generate_secrets.py for that, because it requires the slappasswd. Therefore, you have to login on the OpenLDAP servers and use:

/usr/local/openldap/sbin/slappasswd \
    -o module-path='/usr/local/openldap/libexec/openldap' \
    -o module-load='argon2' -h '{ARGON2}' \
    -s 'pw_value'

The result is a string with 6 $ separated values like this:

'{ARGON2}$argon2id$v=19$m=65536,t=2,p=1$7+plp......nDs5J!dSpg$ywJt/ug9j.........qKcdfsgQwEI'

For the record:

  1. {ARGON2}: identifies which hashing schema was used.
  2. argon2id: lists which Argon 2 algorithm was used.
  3. v=19: version of the Argon 2 algorithm.
  4. m=65536,t=2,p=1: lists values used for arguments for the Argon 2 algorithm.
  5. 7+plp......nDs5J!dSpg: The base64 encoded radom salt that was added by slappasswd
  6. ```ywJt/ug9j.........qKcdfsgQwEI````: The base64 encoded hash.

Use the entire strings as the hash values in group_vars/[stack_name]/secrets.yml.

8. Running playbooks.

There are two wrapper playbooks:

  1. openstack.yml:
    • Creates virtual resources in OpenStack: networks, subnets, routers, ports, volumes and finally the virtual machines.
    • Interacts with the OpenstackSDK / API on localhost.
    • Uses a static inventory from static_inventories/*.yaml parsed with our custom inventory plugin inventory_plugins/yaml_with_jumphost.py
  2. cluster.yml:
    • Configures the virtual machines created with the openstack.yml playbook.
    • Has no dependency on the OpenstackSDK / API.
    • Uses a static inventory from static_inventories/*.yaml parsed with our custom inventory plugin inventory_plugins/yaml_with_jumphost.py

The wrapper playbooks execute several roles in the right order to create the complete stack. Playbooks from the single_role_playbooks/ or single_group_playbooks/ sub directories can be used to (re)deploy individual roles or all roles for only a certain type of machine (inventory group), respectively. These shorter subset playbooks can save a lot of time during development, testing or regular maintenance.

openstack.yml
cluster.yml
Deployment order: local admin accounts and signed host keys must come first

Without local admin accounts we'll need to use

In our case the CentOS cloud image comes with a default centos user.

Note that:

Therefore the first step is to create additional local admin accounts:

Without signed host keys, SSH host key checking must be disabled for this first step. The next step is to deploy the signed host keys. Once these first two steps have been deployed, the rest of the steps can be deployed with a local admin account and SSH host key checking enabled, which is the default.

SSH client config: using the dynamic inventory and jumphosts

In order to reach machines behind the jumphost you will need to configure your SSH client. The templates for the documentation are located in this repo at:
roles/online_docs/templates/mkdocs/docs/
Deployed docs can currently be found at:
http://docs.gcc.rug.nl/
Once configured correctly you should be able to do a multi-hop SSH via a jumphost to a destination server using aliases like this:

Some examples for the Talos development cluster:

9. Verify operation.

See the end user documentation, that was generated with the online_docs role for instructions how to submit a job to test the cluster.