Make guide for HPC options at UCSB

julietcohen commented 1 month ago

We plan to use UCSB's HPC clusters to run the analysis on the new species (and maybe all species if the workflow changes a lot). There are several options, some of which cost money. Put together a guide for the differences between them and a recommendation for which to use.

Considerations:

the data size is not very large (point data) so do not necessarily need a ton of parallelization or cores
the data may take a long time to run, so check cluster run time limits
we want to improve the way we log so may need fast i/o

julietcohen commented 1 month ago

HP Intel (braid)

Infiniband based Sandy Bridge/Ivy Bridge/Broadwell condo cluster.
must buy node

HP Nehalem (guild)

Infiniband based Intel Nehalem condo cluster.

Knot Cluster (knot)

Campus available Infiniband cluster.
has "4 Fat memory nodes"

Pod

Newer Campus available cluster with Omnipath interconnect with CPU, GPU, and 'fat' nodes.
19 TB scratch (more than enough)
has some large memory nodes (101-104)

Braid 2

Newest Condo partipant available Infiniband cluster, both CPU and GPU nodes.
lots of persistent storage (~100 TB scratch, way more than we would ever need)
must buy node

Our needs

Our ideal cluster has no runtime limit or at least a long runtime limit and a short queue so I can test out different slurm scripts.
It may be safer to go with a cluster with a lot of memory
We do not need a lot of parallelization (only a handful of cores).
No need for GPU.

julietcohen commented 1 month ago

After meeting with the UCSB HPC team, Paul and Jay, here's some notes regarding our options for this specific workflow:

conda should be installed in my home dir, it is not available on a machine-wise basis and should not necessarily be installed in /scracth bc there are technically no low storage quotas to worry about in the home dirs
the 2 standard free options are pod and knot. the paid options are way too much money (thousands) and would not really improve our expereince to the degree that it would be worth it, considering that we do not need a ton of parallelization or storage
pod and knot have similar software, but knot's hardware is older than pod's, so knot will be slower but have a much shorter queue, while pod will have a longer queue but shorter runtime
researchers running data analyses and Paul and Jay should be in communication about the jobs. This is a small operation and sometimes a researcher can submit a large job that takes up a large proportion of cores so if the queue to launch a job is longer than 2 hours I should request that my job is moved up in the queue. I can even email them in advance of starting the job.
I can set an option to email me a notification when the job is launched from the queue
run time on these nodes defaults to 72 hours but the runtime in the slurm script can be set to a max of 900:00:00 so really there is no limit because none of our job will take 37 days
by default, launching a slurm job on pod or knot will request a normal memory node, and if I experience memory limitations I simply need to request a high memory node in either the sbatch command or within the shell script itself
similarly, I can request an interactive session so I can run different scripts instead of launching the script from the shell script
by default, persistent storage in /scratch or /bigscratch is ~<1TB, so probably enough for our needs so don't need to worry about requesting more in advance of running a job until I know I need more
recommended to store input data in /scratch before launching the job
for logging, fastest to use /tmp and just transfer the file after the workflow is done but before the job completes

julietcohen commented 4 days ago

Looks like Pod has a Globus endpoint: https://csc.cnsi.ucsb.edu/docs/globus-v5-new

julietcohen commented 3 days ago

Note that Pod's /scratch and home dirs are not on the same mount/filesystem, so transfering a lot of data or many files is slow with just mv, so instead using a rsync command within a tmux session is recommended

syanco / human_mobility_wildlife

Make guide for HPC options at UCSB #2

Our needs