Open joaander opened 1 year ago
With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.
Alternately, we could remove memory_per_cpu
from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory
currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.
Here are some example directives.
# serial
directives = {'processes': 1}
# multiprocessing and/ or threaded app on a single node
directives = {'processes': 8}
directives = {'processes': 1, 'threads_per_process': 8}
# OpenMP application on a single node
directives = {'processes': 1, 'threads_per_process': 8}
# GPU application on a single node
directives = {'processes': 1, 'gpus_per_process': 1}
# MPI application on 1 or more nodes
directives = {'processes': 512, 'launcher': 'mpi'}
directives = {'processes': 512, 'gpus_per_process': 1, 'launcher': 'mpi'}
# Hybrid MPI/OpenMP application on 1 or more nodes
directives = {'processes': 8, 'threads_per_process': 64, 'launcher': 'mpi'}
@joaander I want to give my vote of support for this idea. The landscape of HPC clusters has continued to evolve since I was last actively involved in signac-flow's cluster templates. It seems things have solidified a bit more around core concepts and "directives" that are aligned with the above proposal. I am also generally appreciative and supportive of you proposing and pursuing significant changes like this. 👍
@bdice Thank you for reviewing the proposal and your positive comments.
With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.
To support this on GPU partitions we would also need a memory_per_gpu
directive and to know the total memory per GPU on each GPU partition. The alternative would be to not attempt to warn users about memory usage and instead expect users to correctly set memory_per_cpu
on GPU partitions in a way commensurate with their usage. For example with 64 GB available per GPU:
directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_cpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_cpu': '8g'}
vs.
directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}
In SLURM, --mem-per-cpu
and --mem-per-gpu
are mutually exclusive.
@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!
For Georgia Tech HPCs the --mem-per-cpu
,--mem-per-gpu
, and --mem
are mutually exclusive. I am not aware that we have a --gpus_per_process
, but if we can just omit that if needed that would also be great!
@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!
For Georgia Tech HPCs the
--mem-per-cpu
,--mem-per-gpu
, and--mem
are mutually exclusive. I am not aware that we have a--gpus_per_process
, but if we can just omit that if needed that would also be great!
@bcrawford39GT gpus_per_process
isn't expected to be a scheduler setting here. It is an abstract request that flow instantiates into the appropriate commands for submission script.
Thank you! I was always confused by differences in how flow does things and how user guides for SLURM etc describe things...processes, threads, ranks, oh my!
Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.
If there is usually no cost in amount of memory requested, I highly support this change to make using flow easier for users. I know people who have had jobs confusingly canceled due to running out of memory.
Flow could print that it automatically selected the maximum allowed for the allocation, for instance:
Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.
@cbkerr Signac should likely support the --memory_per_cpu
, as it is available in Slurm, and it will minimize maintenance later. Georgia Tech's system allows it. People may run heterogenous processes with different CPU core numbers, and want to spec out their RAM on a per CPU basis.
There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.
For Georgia Tech HPCs the --mem-per-cpu
,--mem-per-gpu
, and --mem
are mutually exclusive.
There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.
We discussed this offline. We plan to make memory_per_cpu
and memory_per_gpu
available BUT by default set them to the maximum allowable by the selected partition without incurring extra charges. This default behavior should suit the vast majority of users. Specific environments may choose to not set these defaults if desired (e.g. a memory request is unnecessary on whole node jobs).
Users that request more than the maximum will not only incur extra charges, but may also result in broken slurm scripts. For example, I recently tested Purdue Anvil with --ntasks=16
, --mem-per-cpu=2g
. It turns out that the maximum memory is 1918m
and SLURM thus assigned my job to 18 cores. In this configuration both mpirun -n 16
and srun -n 16
were not able to bind ranks to the expected 16 cores and threw errors.
Note that because Anvil automatically scales the CPU request with the memory request, there is no reason to ever request anything less than the maximum. By doing so, you risk out of memory errors in your job. The same goes on Bridges-2 which errors at submission time when you request more than the maximum.
On systems that both default to less than the maximum and allow users to oversubscribe memory and undersubscribe CPUs (Georgia Tech, UMich Great Lakes, Expanse shared queue), users may wish to request less than the maximum (without incurring extra charges). However, the best a user can ever hope to achieve by this is gain some goodwill with the rest of the system's user community - especially those that request more than the maximum.
For Georgia Tech HPCs the
--mem-per-cpu
,--mem-per-gpu
, and--mem
are mutually exclusive.
Yes, this is standard SLURM behavior. I do not recommend the use of --mem
at all in signac-flow
. It is a per node quantity, and flow does not know (in all cases) at submission time exactly how many nodes that SLURM will eventually schedule the job to.
Flow could print that it automatically selected the maximum allowed for the allocation, for instance:
Using environment configuration: Bridges2Environment Selected max allowable memory $N GB per CPU.
Please, only when verbose output is requested - if at all. This information is in the
--pretend
output and can be verified there when needed.
@joaander Yeah, I think what you are saying makes sense. Just need to get rid of Signac's auto printing of --mem
or a way to remove it from auto-printing, and it should be good.
Feature description
Proposed solution
executable
executable
.walltime
walltime
.launcher
None
(the default) or'mpi'
.processes
np
whenlauncher is None
andnranks
whenlauncher == 'mpi'
.threads_per_process
omp_num_threads
with a more general term. Flow will always setOMP_NUM_THREADS
whenthreads_per_process
is greater than 1.gpus_per_process
gpu
.memory_per_cpu
memory
with a more naturally expressible quantity and one that is easier to set appropriately based on the machine configuration.processor_fraction
is not present in the new schema. It is not implementable in any batch scheduler currently in production use. If users desire to oversubscribe resources with many short tasks, they can use an interactive job andrun --parallel
.fork
should also be removed. Flow automatically decides to fork when needed.Additional context
This design would solve #777, provide a more understandable schema for selecting resources, and reduce the effort needed to develop future cluster job templates.
When
launcher is None
:When
launcher == 'mpi'
:srun
,mpirun
, or the appropriate machine specific MPI launcher to distribute processes, threads, memory, and gpus to the appropriate resources.launcher
,processes
,threads_per_process
,gpus_per_process
, andmemory_per_cpu
. Flow will raise an error for any invocation of--bundle --parallel
.launcher
is a string to allow for potential future expansion to some non-MPI launcher capable of distributing processes to multiple nodes: see #220.This refactor solves issues discussed in #777, #455, #115, #235.