open-mpi / ompi-ci-tests

Tests to run in the Open MPI CI
0 stars 1 forks source link

Level Setting: What are we running now? #1

Open jjhursey opened 4 years ago

jjhursey commented 4 years ago

During the Open MPI face-to-face we discussed moving some of the CI checks to AWS which can harness some parallel instances to speed things up. Then each organization can focus on testing special configurations in their environments.

To start this discussion I'd like to see what tests various organizations are running in their CI. Once we have the list then we can work on removing duplicate efforts. We can use this repo as we need to help facilitate this coordination.

Please reply with a comment listing what you are testing now.

jjhursey commented 4 years ago

What I'm looking for is:

jjhursey commented 4 years ago

IBM CI

IBM CI machines are located behind a firewall. As a result, our backend Jenkins service polls GitHub about every 5 minutes for changes to PRs. After a test finishes running we have custom scripts that push the results back as a Gist for the community to see.

Our Open MPI testing is not too specialized at this point. We do have a "virtual cluster" capability available to the community that can scale to 254 nodes on demand. We currently limit the community to 160 nodes, but that can be adjusted.

Platform

Configure options

We run three concurrent builds. Currently, PGI is disabled but will be re-enabled soon.

Build options

Other build types

Tests run

We run across 10 machines with the GNU build, and 2 machines with the other builds. The goal of this testing is to:

  1. Verify that the build is functional and can pass messages. So we test C and Fortran in a non-communicating and communicating program.
  2. The multi-host launch works correctly.

We run the following tests:

All run like:

shell$ /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --map-by ppr:4:node  --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c

Timing

A successful run through CI will take about 12-15 minutes. Most of that is building OMPI.

GNU:
autogen      : (0:03:54)
configure    : (0:02:22)
make         : (0:03:22)
make install : (0:00:58)

XL:
autogen      : (0:03:52)
configure    : (0:03:41)
make         : (0:05:29)
make install : (0:00:35)

PGI:
autogen      : (0:03:33)
configure    : (0:06:05)
make         : (0:10:01)
make install : (0:01:08)
artemry-nv commented 4 years ago

Mellanox Open MPI CI

Scope

Mellanox Open MPI CI is intended to verify Open MPI with recent Mellanox SW components (Mellanox OFED, UCX and other HPC-X components) in the Mellanox lab environment.

CI is managed by Azure Pipelines service.

Mellanox Open MPI CI includes:

Related materials:

Platform

CI Scenarios

Configure options

Specific configure options (combinations may be used):

--with-platform=contrib/platform/mellanox/optimized
--with-ompi-param-check
--enable-picky
--enable-mpi-thread-multiple
--enable-opal-multi-threads

Build options

make -j$(nproc)

Build scenario:

./autogen.sh
./configure ...
make -j$(nproc) install
make -j$(nproc) check

Tests run

Sanity tests (over UCX/HCOLL):

Timing

CI takes ~18-20 min. (mostly Open MPI building).

jjhursey commented 4 years ago

Thanks @artemry-mlnx for that information. Do you test with oshmem as well?

Should we be running make distcheck in CI? Are there other OMPI integrity checks that we should be running on a routine basis?

I'm going to be out for a week, but don't let that stop progress.

jsquyres commented 4 years ago

Should we be running make distcheck in CI?

Yes! But do it in parallel to other CI jobs, because distcheck takes a while. Make sure to use whatever the appropriate AM flag is to pass down a good -j value for make into the process so that the multiple builds that distcheck does aren't performed serially. This can significantly speed up overall distcheck time.

Are there other OMPI integrity checks that we should be running on a routine basis?

make check?

wckzhang commented 4 years ago

Currently, our testing includes mtt with EFA and TCP. This tests v2.x, v3.0.x, v3.1.x, v4.0.x, and master. These are the configure options:

--oversubscribe --enable-picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio --enable-mca-no-build=io-ompio,common-ompio,sharedfp,fbtl,fcoll CC=xlc_r CXX=xlC_r FC=xlf_r --with-ofi=/opt/amazon/efa/

CFLAGS=-pipe --enable-picky --enable-debug --with-ofi=/opt/amazon/efa/

--enable-static --disable-shared

In our nightly, canary, and CI tests for libfabric, we always only use Open MPI 4.0.2 (Soon to be switched to 4.0.3). We use the release versions rather than pulling from the GitHub branch directly. These tests mainly run on our network optimized instance types, such as the c5n model types - https://aws.amazon.com/ec2/instance-types/