Job Defense Shield

The software in this repository, which runs on top of the Jobstats platform, can be used to send automated email alerts to users that are underutilizing the cluster resources. It can also be used for generating reports for administrators. The software identifies the following:

actively running jobs where a GPU has zero utilization
the top users by usage with low CPU or GPU utilization
jobs that request more than the default CPU memory but do not use it
serial jobs that allocate multiple CPU-cores
multinode CPU jobs where one or more nodes have zero utilization
jobs with CPU or GPU fragmentation (e.g., 1 GPU per node over 4 nodes)
users with excessive run time limits
jobs with the most CPU-cores and jobs with the most GPUs
pending jobs with the longest queue times
jobs that use special nodes but do not need them
jobs that could have been run on MIG GPUs instead of full GPUs (e.g., H100)

New alerts are easy to write. Simply start from an existing alert and modify it.

Contact

As this package is being developed, feel free to write to Jonathan Halverson (halverson@princeton.edu) with any comments/requests.

Installation

The requirements are:

Python 3.7 or above
Pandas
jobstats (if looking to send emails about actively running jobs)

The jobstats module depends on requests and, optionally, blessed.

Conda

A Conda environment can be created in this way:

$ conda create --name jds-env pandas pyarrow blessed requests pyyaml -c conda-forge -y

One can store the environment in a specific location by creating this file before running the command above:

$ cat /home/jdh4/.condarc
envs_dirs:
- /home/jdh4/bin

The Python executable will then be available here:

/home/jdh4/bin/jds-env/bin/python

After the environment is made, one can remove or modify the .condarc file so that future installs go elsewhere. If you do not need to inspect actively running jobs then you do not need requests or blessed.

Package Manager

One can also do something like:

$ apt-get install python3-pandas python3-requests python3-yaml python3-blessed

Editing the Configuration File

$ cat config.yaml
%YAML 1.1
---
############################
## LOW CPU/GPU EFFICIENCY ##
############################
low-xpu-efficiency-della-cpu:
  cluster: della
  cluster_name: "Della (cpu)"
  partitions:
    - cpu
  xpu: cpu
  eff_thres_pct: 60
  proportion_thres_pct: 2
  num_top_users: 15
  excluded_users:
    - aturing
    - einstein

low-xpu-efficiency-della-gpu:
  cluster: della
  cluster_name: "Della (gpu)"
  partitions:
    - gpu
  xpu: gpu
  eff_thres_pct: 15
  proportion_thres_pct: 2
  num_top_users: 15
  excluded_users:
    - aturing
    - einstein

#######################
## EXCESS CPU MEMORY ##
#######################
excess-cpu-memory-della-cpu:
  tb_hours_per_day: 10
  ratio_threshold: 0.35
  mean_ratio_threshold: 0.35
  median_ratio_threshold: 0.35
  num_top_users: 10
  clusters:
    - della
  partition:
    - cpu
  combine_partitions: False
  cores_per_node: 28
  excluded_users:
    - aturing
    - einstein

#########################
## SHOULD BE USING MIG ##
#########################
should-be-using-mig-della-gpu:
  cluster: della
  partition: gpu
  excluded_users:
    - aturing
    - einstein

Note that the name of each alert is important (i.e., "should-be-using-mig" must be in the name of the alert).

How to Use

To get started, look at the help menu:

$ git clone https://github.com/jdh4/job_defense_shield.git
$ cd job_defense_shield
$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --help

Here are some specific examples:

$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --zero-gpu-utilization \
                                                          --email \
                                                          --days=7 \
                                                          --files /tigress/jdh4/utilities/job_defense_shield/violations

$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --email \
                                                          --watch \
                                                          --zero-gpu-utilization \
                                                          --low-xpu-efficiencies \ 
                                                          --datascience \
                                                          --gpu-fragmentation

cron

The following is an example cron entry:

SHELL=/bin/bash
MAILTO=jdh4@princeton.edu
JDS=/tigress/jdh4/utilities/job_defense_shield
PY="/home/jdh4/bin/jds-env/bin/python -uB"
CFG=/tigress/jdh4/utilities/job_defense_shield/config.yaml

15  15 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --excess-cpu-memory -M della -r cpu --num-top-users=5 > ${JDS}/log/excess_memory.log 2>&1
20  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --low-xpu-efficiency   > ${JDS}/log/low_efficiency.log 2>&1
26  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=3  --email --zero-cpu-utilization > ${JDS}/log/zero_cpu.log 2>&1
29  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=10 --email --mig -M della -r gpu  > ${JDS}/log/mig.log 2>&1
10  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --serial-using-multiple -M della -r cpu > ${JDS}/log/serial_using_multiple.log 2>&1
40  11 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --excessive-time -M della -r cpu > ${JDS}/log/excessive_time.log 2>&1
30  13 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --cpu-fragmentation > ${JDS}/log/cpu_fragmentation.log 2>&1
 0  14 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=5  --email --gpu-fragmentation > ${JDS}/log/gpu_fragmentation.log 2>&1
 0 */4 * * *   ${JDS}/job_defense_shield.py --days=1  --active-cpu-memory -M della -r cpu --email > ${JDS}/log/active_cpu_memory.log 2>&1
15  15 * * 1-5 ${JDS}/job_defense_shield.py --days=7  --excess-cpu-memory --hard-warning-cpu-memory -M della -r cpu --num-top-users=5 --email > ${JDS}/log/excess_memory.log 2>&1
20   9 * * 1-5 ${JDS}/job_defense_shield.py --days=7  --datascience -M della -r datascience  --email > ${JDS}/log/datascience.log 2>&1
15   9 * * 1-5 /home/jdh4/bin/cluster_report.sh

Cancelling Jobs with 0% GPU Utilization

We do this by running the software on a node that is dedicated to Slurm for a given cluster. The code must be ran as a priviledged user in order to cancel jobs.

Here is an example configuration file:

%YAML 1.1
---
zero-gpu-utilization-della-gpu:
  first_warning_minutes: 60
  second_warning_minutes: 105
  cancel_minutes: 120
  sampling_period_minutes: 15
  min_previous_warnings: 1
  max_interactive_hours: 8
  jobids_file: "/var/spool/slurm/job_defense_shield/jobids.txt"
  clusters:
    - della
  partition:
    - gpu
  excluded_users:
    - aturing
    - einstein
  admin_emails:
    - jdh4@princeton.edu

Here is an example cron entry:

PY=/var/spool/slurm/cancel_zero_gpu_jobs/envs/jds-env/bin
JDS=/var/spool/slurm/job_defense_shield
MYLOG=/var/spool/slurm/cancel_zero_gpu_jobs/log
VIOLATION=/var/spool/slurm/job_defense_shield/violations
MAILTO=jdh4@princeton.edu

*/15 * * * * ${PY}/python -uB ${JDS}/job_defense_shield.py --zero-gpu-utilization --days=1 --email --files=${VIOLATION} -M della -r gpu > ${MYLOG}/zero_gpu_utilization.log 2>&1

Which users have received email alerts?

$ /home/jdh4/bin/jds-env/bin/python -uB /tigress/jdh4/utilities/job_defense_shield/job_defense_shield.py --check --zero-gpu-utilization --days=30

Notes for developers

As Slurm partitions are added and removed the script should be updated
For jdh4, the git repo is /tigress/jdh4/utilities/job_defense_shield

To run the unit tests:

$ module load anaconda3/2023.3
$ pytest  --cov=. --capture=tee-sys tests
$ pytest -s tests  # Hasling says -s to run print statements

Be aware of the following:

Some Traverse jobs are CPU only
Pandas:

>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
>>> df = df[df.A > 10]
>>> df.empty
True
>>> df["C"] = df.apply(lambda row: row["A"] * row["B"], axis="columns")
# ValueError: Wrong number of items passed 2, placement implies 1

df["C"] = df.A.apply(round)  # this is okay
>>>

jdh4 / job_defense_shield

readme