We plan to use UCSB's HPC clusters to run the analysis on the new species (and maybe all species if the workflow changes a lot). There are several options, some of which cost money. Put together a guide for the differences between them and a recommendation for which to use.
Considerations:
the data size is not very large (point data) so do not necessarily need a ton of parallelization or cores
the data may take a long time to run, so check cluster run time limits
we want to improve the way we log so may need fast i/o
After meeting with the UCSB HPC team, Paul and Jay, here's some notes regarding our options for this specific workflow:
conda should be installed in my home dir, it is not available on a machine-wise basis and should not necessarily be installed in /scracth bc there are technically no low storage quotas to worry about in the home dirs
the 2 standard free options are pod and knot. the paid options are way too much money (thousands) and would not really improve our expereince to the degree that it would be worth it, considering that we do not need a ton of parallelization or storage
pod and knot have similar software, but knot's hardware is older than pod's, so knot will be slower but have a much shorter queue, while pod will have a longer queue but shorter runtime
researchers running data analyses and Paul and Jay should be in communication about the jobs. This is a small operation and sometimes a researcher can submit a large job that takes up a large proportion of cores so if the queue to launch a job is longer than 2 hours I should request that my job is moved up in the queue. I can even email them in advance of starting the job.
I can set an option to email me a notification when the job is launched from the queue
run time on these nodes defaults to 72 hours but the runtime in the slurm script can be set to a max of 900:00:00 so really there is no limit because none of our job will take 37 days
by default, launching a slurm job on pod or knot will request a normal memory node, and if I experience memory limitations I simply need to request a high memory node in either the sbatch command or within the shell script itself
similarly, I can request an interactive session so I can run different scripts instead of launching the script from the shell script
by default, persistent storage in /scratch or /bigscratch is ~<1TB, so probably enough for our needs so don't need to worry about requesting more in advance of running a job until I know I need more
recommended to store input data in /scratch before launching the job
for logging, fastest to use /tmp and just transfer the file after the workflow is done but before the job completes
We plan to use UCSB's HPC clusters to run the analysis on the new species (and maybe all species if the workflow changes a lot). There are several options, some of which cost money. Put together a guide for the differences between them and a recommendation for which to use.
Considerations: