ngs-docs / 2021-august-remote-computing

Remote computing workshops in August 2021
https://ngs-docs.github.io/2021-august-remote-computing
4 stars 2 forks source link

workshop 10: Executing large analyses on HPC clusters with slurm #10

Open ctb opened 3 years ago

ctb commented 3 years ago

Thursday August 26 from 9 am - 11:30 PDT

Instructors: Moderator: Marisa Helpers:

Zoom link:

Description:

This two hour workshop will introduce attendees to the slurm system for using, queuing and scheduling analyses on high performance compute clusters. We will also cover cluster computing concepts and talk about how to estimate the compute resources you need and measure how much you’ve used.

draft lesson: https://github.com/ngs-docs/2021-GGG298/tree/latest/Week9-Slurm_and_Farm_cluster_for_doing_analysis

owner: ???

ctb commented 3 years ago

ask sergey to delineate HPC organization for us on campus, along with who can get accounts on what cluster.

jeremywalter commented 3 years ago

Pre-Survey: https://forms.gle/dDi2CR4kET5Zf7eD7 Post-Survey: https://forms.gle/w2vEDChHZ8UP2Pke9

marisalim commented 3 years ago

From Sergey:

rack and node number in the rack

marisalim commented 3 years ago

maybe change the challenge to 1 minute instead of 5 seconds

CHALLENGE: on the farm head node, set yourself up for a 5 second session using srun. What happens when the five seconds are up?

output would look something like this:

(base) datalab-02@c6-94:~$ srun: Force Terminated job 37973767 srun: Job step aborted: Waiting up to 132 seconds for job step to finish. exit

marisalim commented 3 years ago

make time same format for srun and sbatch

srun --partition high2 --time=00:10:00 --pty /bin/bash
sbatch -t 00-00:05:00 -p high2 HelloWorld.sh
s-canchi commented 3 years ago

Time format 00-00:05:00 is Days-hours-:mins:seconds. Should clarify in the notes.

marisalim commented 3 years ago

what is the most number of jobs/CPUs you can request for snakemake? If you have 100 jobs can you request 100 CPUs?

marisalim commented 3 years ago

add how to check on your job progress - check slurm file! it appends output and you can use tail to check end of the file.

marisalim commented 3 years ago

command should be (from section 10.4.1)

/usr/bin/time -v <shell command>

# i.e.,
/usr/bin/time -v ls
marisalim commented 3 years ago

some issues with sstat (empty table?) and scontrol (not showing memory?) - maybe the demos are too small to show up?

expected output should look like this:

+ sstat --format JobID,MaxRSS,AveCPU -P 28300356.batch
JobID|MaxRSS|AveCPU
28300356.batch|10359448K|01:54:51
Name                : charcoal
User                : ctbrown
Partition           : med2
Nodes               : c6-92
Cores               : 32
GPUs                : 0
State               : COMPLETED
Submit              : 2020-12-05T07:51:06
Start               : 2020-12-05T07:51:06
End                 : 2020-12-05T08:17:26
Reserved walltime   : 02:00:00
Used walltime       : 00:26:20
Used CPU time       : 01:54:51
% User (Computation): 87.17%
% System (I/O)      : 12.83%
Mem reserved        : 120000M/node
Max Mem used        : 9.88G (c6-92)
Max Disk Write      : 446.00M (c6-92)
Max Disk Read       : 51.27G (c6-92)
marisalim commented 3 years ago

fix typos in 10.4.3 / partitions