ginkgo-project / gitlab-hpc-ci-cb

GitLab runner for HPC systems using ENROOT and SLURM
BSD 3-Clause "New" or "Revised" License
28 stars 2 forks source link
--- header-includes: | \usepackage{fancyhdr} \pagestyle{fancy} \hypersetup{colorlinks=true, linkcolor=blue, allbordercolors={0 0 0}, pdfborderstyle={/S/U/W 1}} ---

GitLab runner for HPC systems

In rootless mode, by relying on ENROOT and SLURM.

  1. Overview
    1. Purpose and Features
    2. Dependencies
    3. Code Structure
    4. Configuration Variables
      1. Global Options
      2. SLURM Behavior
  2. Installation
    1. Installing a gitlab-runner
    2. Enroot and Cluster Setup
    3. Volume mounting and Ccache setup
  3. Usage Example
  4. License
  5. Links

Overview

Purpose and Features

This set of scripts aims at enabling user-level (no root access required) Continuous Integration on HPC clusters by relying on Gitlab runner's custom executors, ENROOT as a rootless container solution replacement for docker and the SLURM job scheduler when using computing nodes. It also optionally supports Ccache to speed up compilation times on CI jobs. This tool was inspired by the NHR@KIT Cx Project which provides ENROOT and GitLab-runner on their clusters. It is used in production in some of the Ginkgo Software's pipelines.

SLURM usage is optional in this set of scripts as it is considered that many of the simple CI steps, such as compilation, will happen on a login node to optimize computing time and resource sharing. Currently, the script uses non-interactive job submission and waiting loops to ensure the correct completion of the job on the cluster.

A typical use case for this series of scripts is the following:

See the usage example for concrete details.

Dependencies

There are several standard Linux commands used on top of ENROOT and SLURM commands. For some commands, the script can rely on non-standard/GNU-only options. This is for now not optimized.

Always required:

With SLURM:

Code Structure

The code structure is simple, there are the standard GitLab-runner custom executor scripts:

The main configuration variables and functions are defined in the following files:

Configuration Variables

The following variables control some aspects of the script functionality. They can be set as job variables in the script or on the web pages. In the script, they need to be accessed as ${CUSTOM_ENV_<VARIABLE>}.

Global Options

These variables are not SLURM specific and can be used in the default ENROOT only mode.

Optional:

Volumes:

SLURM Behavior

When any of these variables are set, instead of directly running the container on the node where gitlab-runner is running, this will submit a job instead. These variables allow to control the SLURM job submission and related behavior.

These variables control the SLURM job waiting loop behavior:

Installation

The instructions are for a standard Linux system that already supports user mode GitLab and has enroot installed (see dependencies). Also, refer to the NHR@KIT CI user documentation which detail this setup on their systems.

Installing a gitlab-runner

The standard gitlab-runner install command can be used. Make sure to select the custom executor, see gitlab runner registration documentation. Here is an example of what a runner configuration can look like, usually found in ~/.gitlab/config.toml:

[[runners]]
  name = "enroot executor"
  url = "https://gitlab.com"
  token = "<token>"
  executor = "custom"
  builds_dir = "/workspace/scratch/my-ci-project/gitlab-runner/builds/"
  cache_dir = "/workspack/scratch/my-ci-project/gitlab-runner/cache/"
  environment = ["CI_WS=/workspace/scratch/my-ci-project", 
                 "VOL_1_SRC=/workspace/scratch/my-ci-project/ccache", "VOL_1_DST=/ccache",
                 "VOL_2_SRC=/workspace/scratch/my-ci-project/test_data", "VOL_2_DST=/test_data",
                 "NUM_VOL=2", "CCACHE_MAXSIZE=40G"]
  [runners.custom_build_dir]
    enabled = false
  [runners.custom]
    config_exec = "/<path_to>/gitlab-hpc-ci-cb/config.sh"
    prepare_exec = "/<path_to>/gitlab-hpc-ci-cb/prepare.sh"
    run_exec = "/<path_to>/gitlab-hpc-ci-cb/run.sh"
    cleanup_exec = "/<path_to>/gitlab-hpc-ci-cb/cleanup.sh"

Enroot and Cluster Setup

On machines using systemd and logind, enable lingering for your user so that the gitlab-runner daemon can persist when logged off: loginctl enable-linger ${USER}. To check if the property is active, use the command: loginctl show-user $USER --property=Linger, which should output Linger=yes.

As detailed in global options, it is required to set the environment variable CI_WS either in the runner configuration or in the script to be used as a workspace for storing enroot containers, caching, and more.

After the new GitLab runner has been configured, lingering is enabled and the other cluster setup steps are finished, start your runner in user mode with the following commands on a systemd-based system:

# Enable your own gitlab-runner and start it up
systemctl --user enable --now gitlab-runner
# Check that the gitlab runner is running
systemctl --user status gitlab-runner

Volume mounting and Ccache setup

A generic volume mounting interface is provided. This is useful for Ccache support but can be used for other aspects as well. It is configured through multiple environment variables:

  1. VOL_NUM specifies the number of volumes configured.
  2. VOL_1_SRC is the volume source (on the cluster), e.g. ${CI_WS}/ccache
  3. VOL_1_DST is the volume destination (in the container), e.g. /ccache

A full example is available in Installing a gitlab-runner.

Usage Example

Assuming that the code of default_build contains the code for compiling your software in the required setting, and default_test contains the equivalent of make test, the following gitlab-ci YAML configuration will:

Note that this works because both use the same custom name simple_hpc_ci_job, which needs to be unique, but shared among the jobs of the same pipeline.

stages:
  - build
  - test

my_build_job:
  image: ubuntu:xenial
  stage: build
  <<: *default_build
  variables:
    USE_NAME: "simple_hpc_ci_job"
    KEEP_CONTAINER: "ON"
    NVIDIA_VISIBLE_DEVICES: "void"
  tags:
    - my_enroot_runner

slurm_test_job:
  image: ubuntu:xenial
  stage: test
  <<: *default_test
  variables:
    USE_NAME: "simple_hpc_ci_job"
    SLURM_PARTITION: "gpu"
    SLURM_EXCLUSIVE: "ON"
    SLURM_GRES: "gpu:1"
    SLURM_TIME: "00:30:00"
  dependencies: [ "my_build_job" ]
  tags:
    - my_enroot_runner

after_script

The after_script step is never executed inside a SLURM job, but always directly executed instead. It is assumed that this script is only used for cleanup or similar purpose.

License

Licensed under the BSD 3-Clause license.

Links