Refactor directives. - Githubissues

joaander commented 1 year ago

Feature description

Refactor the job directives for logical consistency, to use standard terms, and to better support modern batch schedulers.
Complete the ongoing refactor of the job templates, move as much as possible into the base slurm template. System specific templates should be minimal, if needed at all.

Proposed solution

Replace directives with the new schema:	New schema	Description
`executable`	Same as previous `executable`.
`walltime`	Same as previous `walltime`.
`launcher`	Choose which launcher to use: `None` (the default) or `'mpi'`.
`processes`	The number of processes to execute. Equivalant to the previous `np` when `launcher is None` and `nranks` when `launcher == 'mpi'`.
`threads_per_process`	Replaces the previous `omp_num_threads` with a more general term. Flow will always set `OMP_NUM_THREADS` when `threads_per_process` is greater than 1.
`gpus_per_process`	Number of gpus to schedule per process. Replaces the previous aggregate `gpu`.
`memory_per_cpu`	The amount of memory used per CPU thread. Replaces the previous aggregate `memory` with a more naturally expressible quantity and one that is easier to set appropriately based on the machine configuration.

processor_fraction is not present in the new schema. It is not implementable in any batch scheduler currently in production use. If users desire to oversubscribe resources with many short tasks, they can use an interactive job and run --parallel.

fork should also be removed. Flow automatically decides to fork when needed.

Additional context

This design would solve #777, provide a more understandable schema for selecting resources, and reduce the effort needed to develop future cluster job templates.

When launcher is None:

Flow will request the defined resources, but distributing processes, threads, memory, and gpus to the appropriate resources is left to the application.
Flow will error when more than 1 node is requested. A launcher is required to distribute processes to multiple nodes.
Both serial and parallel bundles are typically supported. Flow will error when asked to launch parallel bundles on machines known to enable aggressive core binding in the job's main shell process (unless we can disable that binding in the template script).

When launcher == 'mpi':

Flow will request the defined resources and ask srun, mpirun, or the appropriate machine specific MPI launcher to distribute processes, threads, memory, and gpus to the appropriate resources.
Serial bundles are supported when all operations in the bundle have identical values for launcher, processes, threads_per_process, gpus_per_process, and memory_per_cpu. Flow will raise an error for any invocation of --bundle --parallel.

launcher is a string to allow for potential future expansion to some non-MPI launcher capable of distributing processes to multiple nodes: see #220.

This refactor solves issues discussed in #777, #455, #115, #235.

joaander commented 1 year ago

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

joaander commented 1 year ago

Here are some example directives.

# serial
directives = {'processes': 1}

# multiprocessing and/ or threaded app on a single node
directives = {'processes': 8}
directives = {'processes': 1, 'threads_per_process': 8}

# OpenMP application on a single node
directives = {'processes': 1, 'threads_per_process': 8}

# GPU application on a single node
directives = {'processes': 1, 'gpus_per_process': 1}

# MPI application on 1 or more nodes
directives = {'processes': 512, 'launcher': 'mpi'}
directives = {'processes': 512, 'gpus_per_process': 1, 'launcher': 'mpi'}

# Hybrid MPI/OpenMP application on 1 or more nodes
directives = {'processes': 8, 'threads_per_process': 64, 'launcher': 'mpi'}

bdice commented 1 year ago

@joaander I want to give my vote of support for this idea. The landscape of HPC clusters has continued to evolve since I was last actively involved in signac-flow's cluster templates. It seems things have solidified a bit more around core concepts and "directives" that are aligned with the above proposal. I am also generally appreciative and supportive of you proposing and pursuing significant changes like this. 👍

joaander commented 1 year ago

@bdice Thank you for reviewing the proposal and your positive comments.

joaander commented 1 year ago

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

To support this on GPU partitions we would also need a memory_per_gpu directive and to know the total memory per GPU on each GPU partition. The alternative would be to not attempt to warn users about memory usage and instead expect users to correctly set memory_per_cpu on GPU partitions in a way commensurate with their usage. For example with 64 GB available per GPU:

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_cpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_cpu': '8g'}

vs.

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}

In SLURM, --mem-per-cpu and --mem-per-gpu are mutually exclusive.

bcrawford39GT commented 1 year ago

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

b-butler commented 1 year ago

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

@bcrawford39GT gpus_per_process isn't expected to be a scheduler setting here. It is an abstract request that flow instantiates into the appropriate commands for submission script.

cbkerr commented 1 year ago

Thank you! I was always confused by differences in how flow does things and how user guides for SLURM etc describe things...processes, threads, ranks, oh my!

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

If there is usually no cost in amount of memory requested, I highly support this change to make using flow easier for users. I know people who have had jobs confusingly canceled due to running out of memory.

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:

Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.

bcrawford39GT commented 1 year ago

@cbkerr Signac should likely support the --memory_per_cpu, as it is available in Slurm, and it will minimize maintenance later. Georgia Tech's system allows it. People may run heterogenous processes with different CPU core numbers, and want to spec out their RAM on a per CPU basis.

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive.

joaander commented 1 year ago

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

We discussed this offline. We plan to make memory_per_cpu and memory_per_gpu available BUT by default set them to the maximum allowable by the selected partition without incurring extra charges. This default behavior should suit the vast majority of users. Specific environments may choose to not set these defaults if desired (e.g. a memory request is unnecessary on whole node jobs).

Users that request more than the maximum will not only incur extra charges, but may also result in broken slurm scripts. For example, I recently tested Purdue Anvil with --ntasks=16, --mem-per-cpu=2g. It turns out that the maximum memory is 1918m and SLURM thus assigned my job to 18 cores. In this configuration both mpirun -n 16 and srun -n 16 were not able to bind ranks to the expected 16 cores and threw errors.

Note that because Anvil automatically scales the CPU request with the memory request, there is no reason to ever request anything less than the maximum. By doing so, you risk out of memory errors in your job. The same goes on Bridges-2 which errors at submission time when you request more than the maximum.

On systems that both default to less than the maximum and allow users to oversubscribe memory and undersubscribe CPUs (Georgia Tech, UMich Great Lakes, Expanse shared queue), users may wish to request less than the maximum (without incurring extra charges). However, the best a user can ever hope to achieve by this is gain some goodwill with the rest of the system's user community - especially those that request more than the maximum.

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive.

Yes, this is standard SLURM behavior. I do not recommend the use of --mem at all in signac-flow. It is a per node quantity, and flow does not know (in all cases) at submission time exactly how many nodes that SLURM will eventually schedule the job to.

joaander commented 1 year ago

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:
Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.
Please, only when verbose output is requested - if at all. This information is in the --pretend output and can be verified there when needed.

bcrawford39GT commented 1 year ago

@joaander Yeah, I think what you are saying makes sense. Just need to get rid of Signac's auto printing of --mem or a way to remove it from auto-printing, and it should be good.

joaander commented 6 months ago

The syntax describe here informed the design of row.

Work on implementing this in flow is started in #819. I have no plans to finish the work that was started myself.

glotzerlab / signac-flow

Refactor directives. #785

Feature description

Proposed solution

Additional context