The software in this repository, which runs on top of the Jobstats platform, can be used to send automated email alerts to users that are underutilizing the cluster resources. It can also be used for generating reports for administrators. The software identifies the following:
New alerts are easy to write. Simply start from an existing alert and modify it.
As this package is being developed, feel free to write to Jonathan Halverson (halverson@princeton.edu) with any comments/requests.
The requirements are:
The jobstats module depends on requests
and, optionally, blessed
.
A Conda environment can be created in this way:
$ conda create --name jds-env pandas pyarrow blessed requests pyyaml -c conda-forge -y
One can store the environment in a specific location by creating this file before running the command above:
$ cat /home/jdh4/.condarc
envs_dirs:
- /home/jdh4/bin
The Python executable will then be available here:
/home/jdh4/bin/jds-env/bin/python
After the environment is made, one can remove or modify the .condarc
file so that future installs go elsewhere. If you do not need to inspect actively running jobs then you do not need requests
or blessed
.
One can also do something like:
$ apt-get install python3-pandas python3-requests python3-yaml python3-blessed
$ cat config.yaml
%YAML 1.1
---
############################
## LOW CPU/GPU EFFICIENCY ##
############################
low-xpu-efficiency-della-cpu:
cluster: della
cluster_name: "Della (cpu)"
partitions:
- cpu
xpu: cpu
eff_thres_pct: 60
proportion_thres_pct: 2
num_top_users: 15
excluded_users:
- aturing
- einstein
low-xpu-efficiency-della-gpu:
cluster: della
cluster_name: "Della (gpu)"
partitions:
- gpu
xpu: gpu
eff_thres_pct: 15
proportion_thres_pct: 2
num_top_users: 15
excluded_users:
- aturing
- einstein
#######################
## EXCESS CPU MEMORY ##
#######################
excess-cpu-memory-della-cpu:
tb_hours_per_day: 10
ratio_threshold: 0.35
mean_ratio_threshold: 0.35
median_ratio_threshold: 0.35
num_top_users: 10
clusters:
- della
partition:
- cpu
combine_partitions: False
cores_per_node: 28
excluded_users:
- aturing
- einstein
#########################
## SHOULD BE USING MIG ##
#########################
should-be-using-mig-della-gpu:
cluster: della
partition: gpu
excluded_users:
- aturing
- einstein
Note that the name of each alert is important (i.e., "should-be-using-mig" must be in the name of the alert).
To get started, look at the help menu:
$ git clone https://github.com/jdh4/job_defense_shield.git
$ cd job_defense_shield
$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --help
Here are some specific examples:
$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --zero-gpu-utilization \
--email \
--days=7 \
--files /tigress/jdh4/utilities/job_defense_shield/violations
$ /home/jdh4/bin/jds-env/bin/python job_defense_shield.py --email \
--watch \
--zero-gpu-utilization \
--low-xpu-efficiencies \
--datascience \
--gpu-fragmentation
The following is an example cron entry:
SHELL=/bin/bash
MAILTO=jdh4@princeton.edu
JDS=/tigress/jdh4/utilities/job_defense_shield
PY="/home/jdh4/bin/jds-env/bin/python -uB"
CFG=/tigress/jdh4/utilities/job_defense_shield/config.yaml
15 15 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7 --email --excess-cpu-memory -M della -r cpu --num-top-users=5 > ${JDS}/log/excess_memory.log 2>&1
20 10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7 --email --low-xpu-efficiency > ${JDS}/log/low_efficiency.log 2>&1
26 10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=3 --email --zero-cpu-utilization > ${JDS}/log/zero_cpu.log 2>&1
29 10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=10 --email --mig -M della -r gpu > ${JDS}/log/mig.log 2>&1
10 10 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7 --email --serial-using-multiple -M della -r cpu > ${JDS}/log/serial_using_multiple.log 2>&1
40 11 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7 --email --excessive-time -M della -r cpu > ${JDS}/log/excessive_time.log 2>&1
30 13 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=7 --email --cpu-fragmentation > ${JDS}/log/cpu_fragmentation.log 2>&1
0 14 * * 1-5 ${PY} ${JDS}/job_defense_shield.py --config-file=${CFG} --days=5 --email --gpu-fragmentation > ${JDS}/log/gpu_fragmentation.log 2>&1
0 */4 * * * ${JDS}/job_defense_shield.py --days=1 --active-cpu-memory -M della -r cpu --email > ${JDS}/log/active_cpu_memory.log 2>&1
15 15 * * 1-5 ${JDS}/job_defense_shield.py --days=7 --excess-cpu-memory --hard-warning-cpu-memory -M della -r cpu --num-top-users=5 --email > ${JDS}/log/excess_memory.log 2>&1
20 9 * * 1-5 ${JDS}/job_defense_shield.py --days=7 --datascience -M della -r datascience --email > ${JDS}/log/datascience.log 2>&1
15 9 * * 1-5 /home/jdh4/bin/cluster_report.sh
We do this by running the software on a node that is dedicated to Slurm for a given cluster. The code must be ran as a priviledged user in order to cancel jobs.
Here is an example configuration file:
%YAML 1.1
---
zero-gpu-utilization-della-gpu:
first_warning_minutes: 60
second_warning_minutes: 105
cancel_minutes: 120
sampling_period_minutes: 15
min_previous_warnings: 1
max_interactive_hours: 8
jobids_file: "/var/spool/slurm/job_defense_shield/jobids.txt"
clusters:
- della
partition:
- gpu
excluded_users:
- aturing
- einstein
admin_emails:
- jdh4@princeton.edu
Here is an example cron entry:
PY=/var/spool/slurm/cancel_zero_gpu_jobs/envs/jds-env/bin
JDS=/var/spool/slurm/job_defense_shield
MYLOG=/var/spool/slurm/cancel_zero_gpu_jobs/log
VIOLATION=/var/spool/slurm/job_defense_shield/violations
MAILTO=jdh4@princeton.edu
*/15 * * * * ${PY}/python -uB ${JDS}/job_defense_shield.py --zero-gpu-utilization --days=1 --email --files=${VIOLATION} -M della -r gpu > ${MYLOG}/zero_gpu_utilization.log 2>&1
$ /home/jdh4/bin/jds-env/bin/python -uB /tigress/jdh4/utilities/job_defense_shield/job_defense_shield.py --check --zero-gpu-utilization --days=30
To run the unit tests:
$ module load anaconda3/2023.3
$ pytest --cov=. --capture=tee-sys tests
$ pytest -s tests # Hasling says -s to run print statements
Be aware of the following:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
>>> df = df[df.A > 10]
>>> df.empty
True
>>> df["C"] = df.apply(lambda row: row["A"] * row["B"], axis="columns")
# ValueError: Wrong number of items passed 2, placement implies 1
df["C"] = df.A.apply(round) # this is okay
>>>