open-mpi / mtt

MPI Testing Tool
https://open-mpi.github.io/mtt
Other
61 stars 47 forks source link

Add Load Lever Support #65

Closed ompiteam closed 10 years ago

ompiteam commented 10 years ago

IU's BigRed supercomputer uses loadlever as the scheduler. The admins for that cluster have given us 1/4 of the machine to test on nightly. That's 128 dual processor machines with Myrinet. {{{ http://rac.uits.iu.edu/rats/research/bigred/hardware.shtml }}}

We would like to run MTT on this machine, and I think the only stumbling block is that MTT doesn't support LoadLeveler. To support this scheduler you just need to look for the following environment variable.

{{{

Of the form "hostA hostA hostB hostB hostC"

LOADL_PROCESSOR_LIST=hostA hostB hostB hostC }}}

Open MPI can run on this cluster with the POE RAS (pending a configuration commit this evening) and RSH PLS.

ompiteam commented 10 years ago

Imported from trac issue 64. Created by jjhursey on 2006-09-08T12:39:39, last modified: 2006-09-12T12:48:31

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-08 12:43:11:

That's relatively easy to do.

As a sidenote -- are there plans to add real LL support to OMPI? (e.g., to have a LL PLS)

ompiteam commented 10 years ago

Trac comment by jjhursey on 2006-09-08 12:46:05:

On BigRed we only need the POE RAS which is essentually LoadLeveler and the RSH PLS. At the moment I don't know if there is another process launch mech for LoadLeveler, but it seems that BigRed is setup to use the RSH PLS correctly.

We were able to run some simple hello world tests on it yesterday using just the POE RAS and RSH PLS. I'm running the IBM test suite now to make sure there isn't any other missing bits.

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-12 09:17:55:

This may be impacted by an IU discovery yesterday that the the LOADL_PROCESSOR_LIST variable is only loaded for "small" allocations. For larger allocations, you have to use a C API to get the node list. Not sure what MTT will do in this situation (because the MTT client is 100% perl). More details to come...

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-12 09:22:15:

(In [320]) Refs #64.

This will work for "small" loadleveler allocations. Need to figure out what to do for "large" loadleveler allocations (where the list of nodes is not available in the $LOADL_PROCESSOR_LIST environment variable).

ompiteam commented 10 years ago

Trac comment by jjhursey on 2006-09-12 09:31:17:

There is a LoadLeveler C library that you all may be able to use. This is how we will likely get around this in Open MPI.

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-12 09:34:07:

Right, but the MTT client also needs to know how many processes it can launch, and the MTT client is in perl.

Is there an environment variable in LoadLeveler that tells us how many processes we can launch (regardless of the allocation size)? That would fix this problem.

ompiteam commented 10 years ago

Trac comment by jjhursey on 2006-09-12 09:48:43:

It doesn't seem to export such a flag. Below is a list of all the flags that are exported into the environment: {{{ [jjhursey@s10c2b2 misc] cat ~/env.txt | grep LOAD LOADL_STEP_CLASS=FAST LOADL_STEP_NICE=0 LOADL_PROCESSOR_LIST=s10c1b4.dim s10c1b4.dim s10c1b3.dim s10c1b3.dim s10c1b2.dim s10c1b2.dim LOADL_STEP_NAME=0 LOADL_ACTIVE=3.3.2.5 LOADL_STEP_ID=s10c2b5.dim.2134.0 LOADLBATCH=yes LOADL_STEP_INITDIR=/N/u/jjhursey/BigRed/src/mpi/misc LOADL_STEP_IN=/dev/null LOADL_PID=20466 LOADL_STEP_TYPE=PARALLEL LOADL_STEP_OUT=/N/u/jjhursey/BigRed/tmp/ll-run-output.stdout LOADL_STEP_OWNER=jjhursey LOADL_STEP_ARGS= LOADL_STEP_GROUP=No_Group LOADL_COMM_DIR=/tmp LOADL_STEP_ERR=/N/u/jjhursey/BigRed/tmp/ll-run-output.stderr LOADL_STARTD_PORT=9611 LOADL_STEP_ACCT= LOADL_STEP_COMMAND=/N/u/jjhursey/BigRed/svn/cloud9/src/mpi/misc/test-script.sh LOADL_JOB_NAME=Simple_mpi_test }}}

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-12 10:07:40:

(In [321]) Refs #64

Make the loadleveler stuff a little bit more robust -- check for LOADLBATCH to see if we're in a LL job, not LOADL_PROCESSOR_LIST, beause PROCESSOR_LIST may be empty/not there. If PROCESSOR_LIST is empty/not there, default to 2.

ompiteam commented 10 years ago

Trac comment by jsquyres on 2006-09-12 12:48:31:

So I think r321 brought in "good enough" support for small LL runs. Josh is looking into a possible CPAN module for LL support (http://search.cpan.org/~hawkinsm/IBM-LoadLeveler-1.05/LoadLeveler.pod).

I'm going to close out this bug for the moment because it should be working for small allocations. Let's open a different bug for integrating the CPAN module.