radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

How to split a limited number of GPUs on a large node? #2825

Open eirrgang opened 1 year ago

eirrgang commented 1 year ago

gpus_per_rank is an integer value that assumes a job will use a clean multiple of GPUs more than ranks.

This does not fit well when there is a strange ratio of GPUs to cores, or when an application can best use compute resources in terms of several processes per GPU.

For example, GROMACS MD simulations can distribute tasks for heterogeneous compute hardware, and find optimal CPU allocation at a relatively small number of OpenMP threads per MPI rank. For instance, I might have 1, 2, or 4 jobs splitting 128 cores and 4 GPUs, with fewer than 32 OpenMP threads per rank.

Can GPUs be reserved independently from numbers of cores?

andre-merzky commented 1 year ago

We plan to move gpus_per_rank to a float value, so that for example 3 ranks could share a GPU. It is not yet fully clear how we will realize backend support for this though - only LSF supports that natively at the moment. So it might take a while until this becomes useful for tasks on non-LSF resources.

eirrgang commented 1 year ago

We plan to move gpus_per_rank to a float value, so that for example 3 ranks could share a GPU. It is not yet fully clear how we will realize backend support for this though - only LSF supports that natively at the moment. So it might take a while until this becomes useful for tasks on non-LSF resources.

Ah. I hadn't thought about the degree of or type of sharing. There may need to be an additional attribute or level of granularity. A distinction may need to be made about which rank "owns" the GPU in the non-LSF case.

I don't think sharing a device between multiple processes is necessary for the feature to be useful. I think there is sufficient application code in the wild already that pins device usage to specific ranks or to rank 0. (I expect a lot of software evolves its multiprocessor and multi-GPU acceleration code paths somewhat independently.)

E.g. a single-node simulation with 1 GPU and MPI-based parallelism for the non-GPU part of the work load