open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Oversubscribe via MCA configuration file or environment variable #8955

Closed dalcinl closed 3 years ago

dalcinl commented 3 years ago

I'm using ompi/master. I cannot figure out how to turn on oversubscription using mca-params.conf or environment variables. Is this supposed to work? I tried the old var config names, as well as new ones from running prte_info, and I cannot make it work, looks like the variables are simply ignored.

jjhursey commented 3 years ago

This OpenPMIx/PRRTE PR might be related https://github.com/openpmix/openpmix/issues/2192

rhc54 commented 3 years ago

When you say "turn on oversubscription", what precisely do you mean? Are you trying to "allow oversubscribing of nodes"? Or something else?

dalcinl commented 3 years ago

@rhc54 Yes, I'm trying to find a way to permanently configure a node to allow for oversubscription, without requiring to pass special flags to mpiexec.

With Open MPI v4.x, I simply set rmaps_base_oversubscribe=true via config file or environment variable, for example:

rhc54 commented 3 years ago

OMPI master (and the v5.0 branch) has moved to PRRTE as its runtime. Thus:

The format of the param file remains the same.

dalcinl commented 3 years ago

I tried that before opening this issue. Just to confirm, I just git pull and rebuilt ompi/master.

The environment variable does not work, see below. About the config file, you did not specified I created it in $HOME/.openmpi/ and set rmaps_base_oversubscribe=true, it does not work. I also tried creating both prte-mca-params.conf and mca-params.conf in $HOME/.prte/ and set rmaps_base_oversubscribe=true, it does not work.

$ PRTE_MCA_rmaps_base_oversubscribe=1 mpiexec -n 32 /usr/bin/true
$ PRTE_MCA_rmaps_base_oversubscribe=1 mpiexec -n 33 /usr/bin/true
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 33
slots that were requested by the application:

  /usr/bin/true

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRTE defaults to the number of processor cores

In all the above cases, if you want PRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
rhc54 commented 3 years ago

Sorry - the parameter is wrong:

PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe

Note the colon needs to be in the front of the qualifier - ":oversubscribe"

dalcinl commented 3 years ago

OK, I kind of got it working. Just to clarify, the location of the config file should be $HOME/.prte/prte-mca-params.conf. I tried putting config files in $HOME/.openmpi/ and/or name them without the initial prte- suffix, and they seem to be ignored, despite the PRRTE's documentation.

However, I'm seeing huge performance degradation when using oversubscription in GitHub Actions workers. @rhc54 I guess this is what #8998 attempts to fix. Am I right?

rhc54 commented 3 years ago

the location of the config file should be $HOME/.prte/prte-mca-params.conf

I corrected the name in the code so it matches the documentation - i.e., "mca-params.conf". I'm not sure why you expected the file to be found in $HOME/.openmpi - the documentation doesn't mention that directory. The PRRTE file has to be in $HOME/.prte. The system-level default is in the $sysconf directory and named prte-mca-params.conf in case other packages share that directory.

I suspect you are right about the degradation, but best way would be for you to try that PR and see if it helps.

dalcinl commented 3 years ago

I had no expectations. I'm not sure why you seem to blame me for trying alternatives when the documented one does not work.

rhc54 commented 3 years ago

Who said anything about blame??? Ease up a little, dude.

gonzalobg commented 2 years ago

@rhc54 Hi Ralph, I think I am running into this same issue with the latest OpenMPI.

The following works:

$ mpirun  --oversubscribe -np 200 ./example

The following fails:

$ mpirun  -np 200 ./example

There are not enough slots available in the system to satisfy the 200
slots that were requested by the application:

  ./example

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.

According to the comment thread on this issue, I'd expect the following to work:

$ PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe mpirun  -np 200 ./example

but it fails with the same error as above.

How can I avoid having to pass mpirun the --oversubscribe option via an environment variable?

Thanks Gonzalo

gonzalobg commented 2 years ago

It seems that OMPI_MCA_rmaps_base_oversubscribe=true works.

rhc54 commented 2 years ago

$ PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe mpirun -np 200 ./example

It works for me on the head of PRRTE master and v3.0 branches. Not sure what version of OMPI you are using, or what hash of PRRTE it is using.