glotzerlab / signac-flow

Workflow management for signac-managed data spaces.
https://signac.io/
BSD 3-Clause "New" or "Revised" License
48 stars 37 forks source link

Refactor directives. #785

Open joaander opened 10 months ago

joaander commented 10 months ago

Feature description

Proposed solution

Replace directives with the new schema: New schema Description
executable Same as previous executable.
walltime Same as previous walltime.
launcher Choose which launcher to use: None (the default) or 'mpi'.
processes The number of processes to execute. Equivalant to the previous np when launcher is None and nranks when launcher == 'mpi'.
threads_per_process Replaces the previous omp_num_threads with a more general term. Flow will always set OMP_NUM_THREADS when threads_per_process is greater than 1.
gpus_per_process Number of gpus to schedule per process. Replaces the previous aggregate gpu.
memory_per_cpu The amount of memory used per CPU thread. Replaces the previous aggregate memory with a more naturally expressible quantity and one that is easier to set appropriately based on the machine configuration.

processor_fraction is not present in the new schema. It is not implementable in any batch scheduler currently in production use. If users desire to oversubscribe resources with many short tasks, they can use an interactive job and run --parallel.

fork should also be removed. Flow automatically decides to fork when needed.

Additional context

This design would solve #777, provide a more understandable schema for selecting resources, and reduce the effort needed to develop future cluster job templates.

When launcher is None:

When launcher == 'mpi':

launcher is a string to allow for potential future expansion to some non-MPI launcher capable of distributing processes to multiple nodes: see #220.

This refactor solves issues discussed in #777, #455, #115, #235.

joaander commented 10 months ago

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

joaander commented 10 months ago

Here are some example directives.

# serial
directives = {'processes': 1}

# multiprocessing and/ or threaded app on a single node
directives = {'processes': 8}
directives = {'processes': 1, 'threads_per_process': 8}

# OpenMP application on a single node
directives = {'processes': 1, 'threads_per_process': 8}

# GPU application on a single node
directives = {'processes': 1, 'gpus_per_process': 1}

# MPI application on 1 or more nodes
directives = {'processes': 512, 'launcher': 'mpi'}
directives = {'processes': 512, 'gpus_per_process': 1, 'launcher': 'mpi'}

# Hybrid MPI/OpenMP application on 1 or more nodes
directives = {'processes': 8, 'threads_per_process': 64, 'launcher': 'mpi'}
bdice commented 10 months ago

@joaander I want to give my vote of support for this idea. The landscape of HPC clusters has continued to evolve since I was last actively involved in signac-flow's cluster templates. It seems things have solidified a bit more around core concepts and "directives" that are aligned with the above proposal. I am also generally appreciative and supportive of you proposing and pursuing significant changes like this. 👍

joaander commented 10 months ago

@bdice Thank you for reviewing the proposal and your positive comments.

joaander commented 10 months ago

With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error.

To support this on GPU partitions we would also need a memory_per_gpu directive and to know the total memory per GPU on each GPU partition. The alternative would be to not attempt to warn users about memory usage and instead expect users to correctly set memory_per_cpu on GPU partitions in a way commensurate with their usage. For example with 64 GB available per GPU:

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_cpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_cpu': '8g'}

vs.

directives = {'processes': 1, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}
directives = {'processes': 1, 'threads_per_process': 8, 'gpus_per_process': 1, 'memory_per_gpu': '64g'}

In SLURM, --mem-per-cpu and --mem-per-gpu are mutually exclusive.

bcrawford39GT commented 10 months ago

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

b-butler commented 10 months ago

@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs!

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive. I am not aware that we have a --gpus_per_process, but if we can just omit that if needed that would also be great!

@bcrawford39GT gpus_per_process isn't expected to be a scheduler setting here. It is an abstract request that flow instantiates into the appropriate commands for submission script.

cbkerr commented 10 months ago

Thank you! I was always confused by differences in how flow does things and how user guides for SLURM etc describe things...processes, threads, ranks, oh my!

Alternately, we could remove memory_per_cpu from this proposal and replace it with an automatic request for the maximum allowed on that partition. I can think of no use-case where it is practical to request less than the maximum. Users typically only need to set memory currently on systems such as Great Lakes where the default is significantly smaller than the maximum. We could make that default be the maximum on systems where it is not.

If there is usually no cost in amount of memory requested, I highly support this change to make using flow easier for users. I know people who have had jobs confusingly canceled due to running out of memory.

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:

Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.
bcrawford39GT commented 10 months ago

@cbkerr Signac should likely support the --memory_per_cpu, as it is available in Slurm, and it will minimize maintenance later. Georgia Tech's system allows it. People may run heterogenous processes with different CPU core numbers, and want to spec out their RAM on a per CPU basis.

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive.

joaander commented 10 months ago

There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using.

We discussed this offline. We plan to make memory_per_cpu and memory_per_gpu available BUT by default set them to the maximum allowable by the selected partition without incurring extra charges. This default behavior should suit the vast majority of users. Specific environments may choose to not set these defaults if desired (e.g. a memory request is unnecessary on whole node jobs).

Users that request more than the maximum will not only incur extra charges, but may also result in broken slurm scripts. For example, I recently tested Purdue Anvil with --ntasks=16, --mem-per-cpu=2g. It turns out that the maximum memory is 1918m and SLURM thus assigned my job to 18 cores. In this configuration both mpirun -n 16 and srun -n 16 were not able to bind ranks to the expected 16 cores and threw errors.

Note that because Anvil automatically scales the CPU request with the memory request, there is no reason to ever request anything less than the maximum. By doing so, you risk out of memory errors in your job. The same goes on Bridges-2 which errors at submission time when you request more than the maximum.

On systems that both default to less than the maximum and allow users to oversubscribe memory and undersubscribe CPUs (Georgia Tech, UMich Great Lakes, Expanse shared queue), users may wish to request less than the maximum (without incurring extra charges). However, the best a user can ever hope to achieve by this is gain some goodwill with the rest of the system's user community - especially those that request more than the maximum.

For Georgia Tech HPCs the --mem-per-cpu,--mem-per-gpu, and --mem are mutually exclusive.

Yes, this is standard SLURM behavior. I do not recommend the use of --mem at all in signac-flow. It is a per node quantity, and flow does not know (in all cases) at submission time exactly how many nodes that SLURM will eventually schedule the job to.

joaander commented 10 months ago

Flow could print that it automatically selected the maximum allowed for the allocation, for instance:

Using environment configuration: Bridges2Environment
Selected max allowable memory $N GB per CPU.

Please, only when verbose output is requested - if at all. This information is in the --pretend output and can be verified there when needed.

bcrawford39GT commented 10 months ago

@joaander Yeah, I think what you are saying makes sense. Just need to get rid of Signac's auto printing of --mem or a way to remove it from auto-printing, and it should be good.

joaander commented 3 months ago

The syntax describe here informed the design of row.

Work on implementing this in flow is started in #819. I have no plans to finish the work that was started myself.