jdh4 / job_defense_shield

GNU General Public License v2.0
3 stars 4 forks source link

Tests codecov License: GPL v2 DOI

Job Defense Shield

The software in this repository, which runs on top of the Jobstats platform, can be used to send automated email alerts to users that are underutilizing the cluster resources. It can also be used for generating reports for administrators. The software identifies the following:

New alerts are easy to write. Simply start from an existing alert and modify it.

Contact

As this package is being developed, feel free to write to Jonathan Halverson (halverson@princeton.edu) with any comments/requests.

Installation

The requirements are:

The jobstats module depends on requests and, optionally, blessed.

Conda

A Conda environment can be created in this way:

$ conda create --name jds-env pandas pyarrow blessed requests pyyaml -c conda-forge -y

One can store the environment in a specific location by creating this file before running the command above:

$ cat /home/jdh4/.condarc
envs_dirs:
- /home/jdh4/bin

The Python executable will then be available here:

/home/jdh4/bin/jds-env/bin/python

After the environment is made, one can remove or modify the .condarc file so that future installs go elsewhere. If you do not need to inspect actively running jobs then you do not need requests or blessed.

Package Manager

One can also do something like:

$ apt-get install python3-pandas python3-requests python3-yaml python3-blessed

Editing the Configuration File

$ cat config.yaml
%YAML 1.1
---
############################
## LOW CPU/GPU EFFICIENCY ##
############################
low-xpu-efficiency-della-cpu:
  cluster: della
  cluster_name: "Della (cpu)"
  partitions:
    - cpu
  xpu: cpu
  eff_thres_pct: 60
  proportion_thres_pct: 2
  num_top_users: 15
  excluded_users:
    - aturing
    - einstein

low-xpu-efficiency-della-gpu:
  cluster: della
  cluster_name: "Della (gpu)"
  partitions:
    - gpu
  xpu: gpu
  eff_thres_pct: 15
  proportion_thres_pct: 2
  num_top_users: 15
  excluded_users:
    - aturing
    - einstein

#######################
## EXCESS CPU MEMORY ##
#######################
excess-cpu-memory-della-cpu:
  tb_hours_per_day: 10
  ratio_threshold: 0.35
  mean_ratio_threshold: 0.35
  median_ratio_threshold: 0.35
  num_top_users: 10
  clusters:
    - della
  partition:
    - cpu
  combine_partitions: False
  cores_per_node: 28
  excluded_users:
    - aturing
    - einstein

#########################
## SHOULD BE USING MIG ##
#########################
should-be-using-mig-della-gpu:
  cluster: della
  partition: gpu
  excluded_users:
    - aturing
    - einstein

Note that the name of each alert is important (i.e., "should-be-using-mig" must be in the name of the alert).

How to Use

To get started, look at the help menu:

$ git clone https://github.com/jdh4/job_defense_shield.git
$ cd job_defense_shield
$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --help

Here are some specific examples:

$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --zero-gpu-utilization \
                                                          --email \
                                                          --days=7 \
                                                          --files /tigress/jdh4/utilities/job_defense_shield/violations

$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --email \
                                                          --watch \
                                                          --zero-gpu-utilization \
                                                          --low-xpu-efficiencies \ 
                                                          --datascience \
                                                          --gpu-fragmentation                          

cron

The following is an example cron entry:

SHELL=/bin/bash
MAILTO=jdh4@princeton.edu
JDS=/tigress/jdh4/utilities/job_defense_shield
PY="/home/jdh4/bin/jds-env/bin/python -uB"
CFG=/tigress/jdh4/utilities/job_defense_shield/config.yaml

15  15 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --excess-cpu-memory -M della -r cpu --num-top-users=5 > ${JDS}/log/excess_memory.log 2>&1
20  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --low-xpu-efficiency   > ${JDS}/log/low_efficiency.log 2>&1
26  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=3  --email --zero-cpu-utilization > ${JDS}/log/zero_cpu.log 2>&1
29  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=10 --email --mig -M della -r gpu  > ${JDS}/log/mig.log 2>&1
10  10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --serial-using-multiple -M della -r cpu > ${JDS}/log/serial_using_multiple.log 2>&1
40  11 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --excessive-time -M della -r cpu > ${JDS}/log/excessive_time.log 2>&1
30  13 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7  --email --cpu-fragmentation > ${JDS}/log/cpu_fragmentation.log 2>&1
 0  14 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=5  --email --gpu-fragmentation > ${JDS}/log/gpu_fragmentation.log 2>&1
 0 */4 * * *   ${JDS}/job_defense_shield.py --days=1  --active-cpu-memory -M della -r cpu --email > ${JDS}/log/active_cpu_memory.log 2>&1
15  15 * * 1-5 ${JDS}/job_defense_shield.py --days=7  --excess-cpu-memory --hard-warning-cpu-memory -M della -r cpu --num-top-users=5 --email > ${JDS}/log/excess_memory.log 2>&1
20   9 * * 1-5 ${JDS}/job_defense_shield.py --days=7  --datascience -M della -r datascience  --email > ${JDS}/log/datascience.log 2>&1
15   9 * * 1-5 /home/jdh4/bin/cluster_report.sh

Cancelling Jobs with 0% GPU Utilization

We do this by running the software on a node that is dedicated to Slurm for a given cluster. The code must be ran as a priviledged user in order to cancel jobs.

Here is an example configuration file:

%YAML 1.1
---
zero-gpu-utilization-della-gpu:
  first_warning_minutes: 60
  second_warning_minutes: 105
  cancel_minutes: 120
  sampling_period_minutes: 15
  min_previous_warnings: 1
  max_interactive_hours: 8
  jobids_file: "/var/spool/slurm/job_defense_shield/jobids.txt"
  clusters:
    - della
  partition:
    - gpu
  excluded_users:
    - aturing
    - einstein
  admin_emails:
    - jdh4@princeton.edu

Here is an example cron entry:

PY=/var/spool/slurm/cancel_zero_gpu_jobs/envs/jds-env/bin
JDS=/var/spool/slurm/job_defense_shield
MYLOG=/var/spool/slurm/cancel_zero_gpu_jobs/log
VIOLATION=/var/spool/slurm/job_defense_shield/violations
MAILTO=jdh4@princeton.edu

*/15 * * * * ${PY}/python -uB ${JDS}/job_defense_shield.py --zero-gpu-utilization --days=1 --email --files=${VIOLATION} -M della -r gpu > ${MYLOG}/zero_gpu_utilization.log 2>&1

Which users have received email alerts?

$ /home/jdh4/bin/jds-env/bin/python -uB /tigress/jdh4/utilities/job_defense_shield/job_defense_shield.py --check --zero-gpu-utilization --days=30

Notes for developers

To run the unit tests:

$ module load anaconda3/2023.3
$ pytest  --cov=. --capture=tee-sys tests
$ pytest -s tests  # Hasling says -s to run print statements

Be aware of the following:

  1. Some Traverse jobs are CPU only
  2. Pandas:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
>>> df = df[df.A > 10]
>>> df.empty
True
>>> df["C"] = df.apply(lambda row: row["A"] * row["B"], axis="columns")
# ValueError: Wrong number of items passed 2, placement implies 1

df["C"] = df.A.apply(round)  # this is okay
>>>